Combine several google-pprof files (pproc CPU profiler) - profiling

I want to use CPU Profiler from google-perftools (gperftools's libprofiler.so ), which is described here:
http://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html
In my setup I want to run the program to be profiled several times (up to 1500; each run is different) and then combine pprof outputs from all runs (or from some subset of runs) into single pprof file.
How can I do this?
PS: My program uses almost no shared libraries, so only single binary (elf) file will be analyzed.
PPS: Thanks to Chris, pprof can use several profiles:
pprof ./program first.pprof.out second.pprof.out ...

Related

QtCreator performance tunning

I use QtCreator whenever I can although its performance is not great sometimes.
I have a feeling situation gets worse with large source files, to put a number here I'll say over 1000 lines.
It seems disabling a couple of Helper plugins makes it take less CPU.
Is there a way to know CPU usage by each plugin? Which plugins are the most CPU hungry?
Now I'm going with the following and CPU usage seems good (almost close to 1% all the time).
you can disable the clangCodeModel plugin and cppCheck to reduce CPU usage but the main processing used by a background parser that tokenizes and read symbols from your source file. sometimes the third party library may contain myriad file count and make the Qtcreator slow. also, you can reduce the files that must parse by the "Clang Code Model" panel (Tools > Options > C++ > Code Model.).

Callgrind: How to use Callgrind tool to evaluate function speed

I am interested in testing speed of some function calls from the code written in C/C++. I searched and I was directed to use Valgrind platform with Callgrind tool.
I have briefly read the manual, but I am still wondering how I can utilize the functionality of the tool to evaluate the time of my function runtime speed.
I was wondering if I could get some pointers how I can achieve my goal.
Any help would be appreciated.
Compile your program with debug symbols (e.g. GDB symbols works fine, which are activated with the "-ggdb" flag).
If you are executing your program like this:
./program
Then run it with Valgrind+Callgrind with this command:
valgrind --tool=callgrind ./program
Callgrind will then produce a file called callgrind.out.1234 (1234 is the process ID and will probably be different when you run). Open this file with:
cg_annotate callgrind.out.1234
You may want to use grep to extract your function name. In the left column the number of instructions used for that function is displayed. Functions that use a comparatively low amount of instructions will be ignored, though.
If you want to see the output some nice graphics, I would recommend you to install KCachegrind.

G++ Compilation Statistics

I was wondering if there is any way to gather statistics from GCC/G++ compilation process. Metrics like the number of lines compiled in the entire process, total time spent compiling, number of errors/warnings, number/size of compiled objects and so on.
I would like to make a script ( maybe in python ) to generate statistical information in a daily, weekly and monthly basis.
Any ideas?
Thanks
I know one, it's called Cdash and it's part of a larger and ideal suite that virtually includes Cmake, Ctest and Cpack.
This will probably be an interesting video for you

Profiling for wall-time on Linux

I have an application that I want to profile wrt how much time is spent in various activities. Since this application is I/O intensive, I want to get a report that will summarize how much time is spent in every library/system call (wall time).
I've tried oprofile, but it seems it gives time in terms of Unhalted CPU cycles (thats cputime, not real time)
I've tried strace -T, which gives wall time, but the data generated is huge and getting the summary report is difficult (and awk/py scripts exist for this ?)
Now I'm looking upto SystemTap, but I don't find any script that is close enough and can be modified, and the onsite tutorial didn't help much either. I am not sure if what I am looking for can be done.
I need someone to point me in the right direction.
Thanks a lot!
Judging from this commit, the recently released strace 4.9 supports this with:
strace -w -c
They call it "syscall latency" (and it's hard to see from the manpage alone that's what -w does).
Are you doing this just out of measurement curiosity, or because you want to find time-drains that you can fix to make it run faster?
If your goal is to make it run as fast as possible, then try random-pausing.
It doesn't measure anything, except very roughly.
It may be counter-intuitive, but what it does is pinpoint the code that will result in the greatest speed-up.
See the fntimes.stp systemtap sample script. https://sourceware.org/systemtap/examples/index.html#profiling/fntimes.stp
The fntimes.stp script monitors the execution time history of a given function family (assumed non-recursive). Each time (beyond a warmup interval) is then compared to the historical maximum. If it exceeds a certain threshold (250%), a message is printed.
# stap fntimes.stp 'kernel.function("sys_*")'
or
# stap fntimes.stp 'process("/path/to/your/binary").function("*")'
The last line of that .stp script demonstrates the way to track time consumed in a given family of functions
probe $1.return { elapsed = gettimeofday_us()-#entry(gettimeofday_us()) }

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.