How use callgrind to profiling only a certain period of program execution?

How use callgrind to profiling only a certain period of program execution? - profiling

I want to use valgrind to do some profiling, since it does not need re-build the program. (the program I want to profile is already build with “-g")
But valgrind(callgrind) is quite slow ... so here's what I to do:
start the server ( I want to profile that server)
kind of attach to that server
before I do some operation on server, start collect profile data
after the operation is done, end collecting profile data
analyze the profiling data.
I can do this kind of thing using sun studio on Solaris. (using dbx ). I just want to know is it possible to do the same thing using valgrind(callgrind)?
Thanks

You should look at callgrind documentation, and read about callgrind_control.
Launch your app : valgrind --tool=callgrind --instr-atstart=no your_server.x
See 1.
start collect profile data: callgrind_control -i on
end collect profile data: callgrind_control -i off
Analyze data with kcachegrind or callgrind_annotate/cg_annotate

For profiling only some function you can also find useful CALLGRIND_START_INSTRUMENTATION and CALLGRIND_STOP_INSTRUMENTATION from <valgrind/callgrind.h> header and using callgrind's --instr-atstart=no option as suggested in Doomsday's answer.

You don't say what OS - I'm assuming Linux - in which case you might want to look at oprofile (free) or Zoom (not free, but you can get an evaluation licence), both of which are sampling profilers and can profile existing code without re-compilation. Zoom is much nicer and easier to use (it has a GUI and some nice additional features), but you probably already have oprofile on your system.

Related

attaching callgrind/valgrind to program in mid-way of its execution

I use Valgrind to detect issues by running a program from the beginning.
Now I have memory/performance issues in a very specific moment in the program. Unfortunately there is no feasible way to make a shortcut to this place from the start.
Is there a way to instrument the c++ program ( Valgrind/Callgrind ) in its mid-way execution, like attaching to the process?
Already answered here:
How use callgrind to profiling only a certain period of program execution?

There is no way to use valgrind on an already running program.
For callgrind, you can however somewhat speed it up by only recording data later during execution.
For this, you might e.g. look at callgrind options
--instr-atstart=no|yes Do instrumentation at callgrind start [yes]
--collect-atstart=no|yes Collect at process/thread start [yes]
--toggle-collect=<func> Toggle collection on enter/leave function
You can also control such aspects from within your program.
Refer to valgrind/callgrind user manual for more details.

There are two things that Callgrind does that slows down execution.
Counting operations (the collect part)
Simulating the cache and branch predictor (the instrumentation part). This depends on the --cache-sim and --branch-sim options, which both default to "no". If you are using these options and you disable instrumentation then I expect that there will be some impact on the accuracy of the modelling as the cache and predictor won't be "warm" when they are toggled.
There are a few other approaches that you could use.
Use the client request mechanisms. This would require you to include a Valgrind header and add a few lines that use Valgrind macros CALLGRIND_START_INSTRUMENTATION/CALLGRIND_STOP_INSTRUMENTATION and CALLGRIND_TOGGLE_COLLECTto start and stop instrumentation/collection. See the manual for details. Then just run your application under Valgrind with --instr-atstart=no --collect-atstart=no
Use the gdb monitor commands. In this case you would have two terminals. In the first you would run Valgrind with --instr-atstart=no --collect-atstart=no --vgdb=yes. In the second terminal run gdb yourapplication then from the gdb prompt (gdb) target remote | vgdb. Then you can use the monitor commands as described in the manual, which include collect and instrumentation control.
callgrind_control, part of the Valgrind distribution. To be honest I've never used this.
I recently did some profiling using the first technique without cache/branch sim. When I used Callgrind on the whole run I saw a 23x runtime increase compared to running outside of Callgrind. When I profiled only the one function that I wanted to analyze, this fell to about a 5x slowdown. Obviously this will very greatly case by case.

Thanks all for help.
Seems It's been already answered here: How use callgrind to profiling only a certain period of program execution?
but not marked as answered for some reason.
Summary:
Starting callgrind with instrumentation off
valgrind --tool=callgrind --instr-atstart=no <PROG>
Controling instrumentation ( can be executed in other shell )
callgrind_control -i on/off

How to develop a debugger

I am looking for a way to do things such as attach to a process, set breakpoints, view memory, and other things that gdb/lldb can do. I cannot, however, find a way to do these things.
This question is similar to this one, but for MacOS instead of Windows. Any help is appreciated!
Note: I want to make a debugger, not use one.
Another thing is that i dont want this debugger to be super complicated, all i need is just reading/writing memory, breakpoint handling, and viewing the GPR

If you really want to make your own debugger, another way to start would be to figure out how to cons up and parse the gdb-remote protocol packets (e.g. https://sourceware.org/gdb/onlinedocs/gdb/Remote-Protocol.html). That's the protocol gdb uses when remote debugging and lldb uses for everything but Windows debugging. On MacOS, lldb spawns a debugserver instance which does the actual debugging and controls it with gdb-remote protocol packets. On Linux, it uses the lldb-server tool that's part of the Linux lldb distribution for the same purpose.
The gdb-remote protocol has primitives for most of the operations you want to perform, launch a process, attach to a process, set breakpoints, read memory & registers and isolates you from a lot of the low-level details of controlling processes.
You can help yourself out by observing how lldb uses this protocol by running an lldb debug session with:
(lldb) log enable gdb-remote packets
But you might also have a look at the SB API's in lldb. The documentation is not as advanced as it should be but there are a bunch of examples in the examples/python directory of the lldb sources to get you started, and in general the API's are pretty straightforward and self-explanatory.

LLDB has an API that can be consumed from C++ or Python. Maybe this is what you’re looking for.
Unfortunately the documentation is fairly threadbare, and there don’t seem to be usage examples. This will therefore entail some reading of the source and a lot of trial and error.

If you want to write your own debugger, you'll need to obtain a task port to the process (task_for_pid), then you can read/write/iterate to virtual memory (mach_vm_read, mach_vm_write, mach_vm_region). To do breakpoints, you need to first set up an exception handler then you can manipulate the debug registers on threads (task_threads, thread_get_state, thread_set_state) and handle them in the exception handler.
Reference to some not all that correct debugger code I've written because breakpoints (especially instruction ones) are a bit involved.
MacDBG may be another lightweight reference but I haven't used it myself.

Well, if you want to write a debugger, take a look at the gdb/lldb source code. I'd suggest the latter, due to historical legacy in gdb that might cloud whatever is actually going on.

Use a debugger. Such as gdb or lldb for example. They have plenty of documentation to teach you the how bit, but for example gdb -p <pid of process> will attach gdb to a running process.
If you want to drive gdb for example from a C++ program, then launch it in a separate process (see fork and exec) with the aguments it needs (probably including the one to enable its machine parsable interface). Make sure you set up pipes to its stdin/stdout so you can read its output and send it commands.
If instead you want to write your own debugger from scratch then that is a huge undertaking. I suggest starting by reading the source of an existing open source debugger.

Whilst you could look at source code of another debugger, you may find it difficult to understand without the knowledge of the underlying concepts. Therefore, I recommend you start by obtaining a good grounding of the concepts with the following resources:
Mac OS X Sys Internals
Rather outdated now, but the original bible for the internals of Mac
Mac OS X and iOS Internals
Again, outdated but useful. This link is Jonathan Levin's (the author's) own site and he's now providing it for free, due to issues he had with the publisher. He's since purchased back the rights, making it available to all.
*OS Internals
The current bible of Mac Internals, also by Jonathan Levin. Books III and I have been published, with book II to follow shortly!

Minimizing the size of debugging information for testing at a remote location

I am trying to create a way to transfer the debug information of a C++ project to a remote location for testing. In the current development cycle, small changes to the code require the entire binary (100s MB in size and mostly debug info) to be transferred.
Currently my approach to addressing this is by splitting the debugging information from the object files (the size of which without the debugging info is manageable on my connection) using -gsplit-dwarf and then diffing the debug files against a copy of the build currently on the remote box.
The aim is to have a set of patches for the debug files of a project so that new code can be debugged at a remote location. The connection between the remote location and the local machine is slow and so minimization of the size of the patches is paramount but it should also be balanced with the run time of the tool. I have looked into bsdiff and xdelta as potential solutions and have run into a conundrum where xdetla is fast but too large and bsdiff is perfect in terms of size but the run time and memory requirements are a little higher than I would like it to be.
Is there a tool or approach I am missing or am I just going about this the wrong way? Some alternative to bsdiff and xdelta perhaps? I know that a tool like gbdserver won't work in this situation because of some of the requirements we have with the actual debugging. Could some alteration of bsdiff help the performance? And indeed if the approach I'm using is sound, what would be a good way to keep a copy of the build on the remote machine around to diff against.

The simplest way is to use "strip" to copy the debuginfo into a separate ".debug" file, and then use "strip" again to remove the debug info from the executable that you will deploy. The "strip" manual explains how to do this, look for the "--only-keep-debug" option.
After you do this, you can tell gdb about the separate debug info in various ways. The very best way is to use the "build-id" feature. This is what modern Linux distros do. However there are other ways as well. There's a whole section in the gdb manual about separate debug files.
The key point here is that you can start gdb on the stripped executable and it will find the separate debug info automatically. This data can all be local, so you won't need to deploy the debug info.
If you still care about shrinking debug info even when this is done, you can look at the "dwz" tool. This is a DWARF compressor. However this usually only matters if you plan to ship the debug info somewhere -- distros use it to make it easier to download debug info, but ordinary users won't really see the need.

How to write a simple debugger?

I would like to connect my compiled object code to the c++ code, and then to check if a certain line of code was executed.
How to do those two things?
If the explanation is not simple (I bet it is not), can someone at least point to some web pages explaining how to do this?
I understand that the solution is different for different platforms, but I am interested in how it is done on windows and linux (linux for the start)

If you want to know how it is done,
gdb is open source
the ptrace syscall should get you started,
libunwind-ptrace
this is a nice article using ptrace

I suspect that you don't really need a debugger but a profiler. I like the callgrind at http://valgrind.org/docs/manual/cl-manual.html, which has a nice graphical environment at http://kcachegrind.sourceforge.net/.
To try I'd use
$ valgrind --tool=callgrind ./myapp
$ kcachegrind callgrind.out.xxx

In your comment you say "I would just like to gather informations on how to check which methods/functions are executed during the execution, and how many times".
If that is what you want to achieve, then use a profiler such as gprof.
Compile your program with -g -pg and when your program finishes it will create a file which can be processed by gprof to show you what you want.

Linux time sample based profiler

short version:
Is there a good time based sampling profiler for Linux?
long version:
I generally use OProfile to optimize my applications. I recently found a shortcoming that has me wondering.
The problem was a tight loop, spawning c++filt to demangle a c++ name. I only stumbled upon the code by accident while chasing down another bottleneck. The OProfile didn't show anything unusual about the code so I almost ignored it but my code sense told me to optimize the call and see what happened. I changed the popen of c++filt to abi::__cxa_demangle. The runtime went from more than a minute to a little over a second. About a x60 speed up.
Is there a way I could have configured OProfile to flag the popen call? As the profile data sits now OProfile thinks the bottle neck was the heap and std::string calls (which BTW once optimized dropped the runtime to less than a second, more than x2 speed up).
Here is my OProfile configuration:
$ sudo opcontrol --status
Daemon not running
Event 0: CPU_CLK_UNHALTED:90000:0:1:1
Separate options: library
vmlinux file: none
Image filter: /path/to/executable
Call-graph depth: 7
Buffer size: 65536
Is there another profiler for Linux that could have found the bottleneck?
I suspect the issue is that OProfile only logs its samples to the currently running process. I'd like it to always log its samples to the process I'm profiling. So if the process is currently switched out (blocking on IO or a popen call) OProfile would just place its sample at the blocked call.
If I can't fix this, OProfile will only be useful when the executable is pushing near 100% CPU. It can't help with executables that have inefficient blocking calls.

Glad you asked. I believe OProfile can be made to do what I consider the right thing, which is to take stack samples on wall-clock time when the program is being slow and, if it won't let you examine individual stack samples, at least summarize for each line of code that appears on samples, the percent of samples the line appears on. That is a direct measure of what would be saved if that line were not there. Here's one discussion. Here's another, and another. And, as Paul said, Zoom should do it.
If your time went from 60 sec to 1 sec, that implies every single stack sample would have had a 59/60 probability of showing you the problem.

Try Zoom - I believe it will let you profile all processes - it would be interesting to know if it highlights your problem in this case.

I wrote this a long time ago, only because I couldn't find anything better: https://github.com/dicej/profile
I just found this, too, though I haven't tried it: https://github.com/oliver/ptrace-sampler

Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html
It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.

After trying everything suggested here (except for the now-defunct Zoom, which is still available as huge file from dropbox), I found that NOTHING does what Mr. Dunlavey recommends. The "quick hacks" listed above in some of the answers wouldn't build for me, or didn't work for me either. Spent all day trying stuff... and nothing could find fseek as a hotspot in an otherwise simple test program that was I/O bound.
So I coded up yet another profiler, this time with no build dependencies, based on GDB, so it should "just work" for almost any debuggable code. A single CPP file.
https://github.com/jasonrohrer/wallClockProfiler
It automates the manual process suggested by Mr. Dunlavey, interrupting the target process with GDB periodically and harvesting a stack trace, and then printing a report at the end about which stack traces are the most common. Those are your true wall-clock hotspots. And it actually works.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js