Is VisualVM instrumenting bytecode? - profiling

I'm confused a little: AFAIK VisualVM perform profiling and sampling, so does it mean it not only makes dumps (thread stacks + memory state) but also instrumenting the code?
From here: https://stackoverflow.com/a/12130149/10894456 explained the profiling implies instrumenting. But does VisualVM makes instrumenting by itself or need something to prepare (like Java Agent or something)?

Yes, when you use the Profiler, VisualVM will instrument the bytecode as necessary. This can only be done via an Agent, so VisualVM includes such a Java Agent. When you are connected to a JVM on the same machine, it may use the Attach API to load the Agent into the target JVM dynamically. So in this use case, it doesn’t need additional preparation steps on the user’s side.

Related

attaching callgrind/valgrind to program in mid-way of its execution

I use Valgrind to detect issues by running a program from the beginning.
Now I have memory/performance issues in a very specific moment in the program. Unfortunately there is no feasible way to make a shortcut to this place from the start.
Is there a way to instrument the c++ program ( Valgrind/Callgrind ) in its mid-way execution, like attaching to the process?
Already answered here:
How use callgrind to profiling only a certain period of program execution?
There is no way to use valgrind on an already running program.
For callgrind, you can however somewhat speed it up by only recording data later during execution.
For this, you might e.g. look at callgrind options
--instr-atstart=no|yes Do instrumentation at callgrind start [yes]
--collect-atstart=no|yes Collect at process/thread start [yes]
--toggle-collect=<func> Toggle collection on enter/leave function
You can also control such aspects from within your program.
Refer to valgrind/callgrind user manual for more details.
There are two things that Callgrind does that slows down execution.
Counting operations (the collect part)
Simulating the cache and branch predictor (the instrumentation part). This depends on the --cache-sim and --branch-sim options, which both default to "no". If you are using these options and you disable instrumentation then I expect that there will be some impact on the accuracy of the modelling as the cache and predictor won't be "warm" when they are toggled.
There are a few other approaches that you could use.
Use the client request mechanisms. This would require you to include a Valgrind header and add a few lines that use Valgrind macros CALLGRIND_START_INSTRUMENTATION/CALLGRIND_STOP_INSTRUMENTATION and CALLGRIND_TOGGLE_COLLECTto start and stop instrumentation/collection. See the manual for details. Then just run your application under Valgrind with --instr-atstart=no --collect-atstart=no
Use the gdb monitor commands. In this case you would have two terminals. In the first you would run Valgrind with --instr-atstart=no --collect-atstart=no --vgdb=yes. In the second terminal run gdb yourapplication then from the gdb prompt (gdb) target remote | vgdb. Then you can use the monitor commands as described in the manual, which include collect and instrumentation control.
callgrind_control, part of the Valgrind distribution. To be honest I've never used this.
I recently did some profiling using the first technique without cache/branch sim. When I used Callgrind on the whole run I saw a 23x runtime increase compared to running outside of Callgrind. When I profiled only the one function that I wanted to analyze, this fell to about a 5x slowdown. Obviously this will very greatly case by case.
Thanks all for help.
Seems It's been already answered here: How use callgrind to profiling only a certain period of program execution?
but not marked as answered for some reason.
Summary:
Starting callgrind with instrumentation off
valgrind --tool=callgrind --instr-atstart=no <PROG>
Controling instrumentation ( can be executed in other shell )
callgrind_control -i on/off

How can you trace execution of an embedded system emulted in QEMU?

I've built OpenWrt for x86 and I'm using QEMU to run it virtually.I'm trying to debug this system in real time. I need to see things like network traffic flowing etc.
I can attach gdb remotely and execute (mostly) step by step with break points. I really want trace points though. I don't want to pause execution and loose network flow. When I tried setting trace points using tstart, I see the message "Target does not support this command". I did a bit of reading of the gdb documentation and from what I can tell the gdb stub that runs to intercept normal execution in QEMU does not support trace points.
From here I started looking at other tools and ran across PANDA (https://github.com/panda-re/panda). As I understand PANDA will capture a complete system trace in a log and allow for replay. I think this tool is supposed to do what I need, but I cannot seem to replay the results. I see the logs, I just can't replay them.
Now, I'm a bit stuck on what other tools/options I might have to actually trace a running embedded system. Are there any good tools you can recommend? Or perhaps another method I've missed?
If you want to see the system calls and signals use strace.
Strace can also be used with running process and it can put the output in a log file if required.
In OpenWrt it is possible to build with ftrace. Ftrace has much of the functionality I required but not all.
To build with ftrace, the option for ftrace must be selected in the build menu. Additionally there are a variety of tracer options that must also be enabled.
The trace-cmd (ftrace) is located in menuconfig/Development
Tracing support is under menuconfig/Global build settings/Compile the kernel with tracing support and includes: Trace system calls, Trace process context switches and events, and Function tracer (Function graph tracer, Enable/disable function tracing dynamically, and Function profiler)
I'm also planning to build a custom GDB stub to do this a little bit better as I also want to see the data passed to the functions not just the function calls.

Minimizing the size of debugging information for testing at a remote location

I am trying to create a way to transfer the debug information of a C++ project to a remote location for testing. In the current development cycle, small changes to the code require the entire binary (100s MB in size and mostly debug info) to be transferred.
Currently my approach to addressing this is by splitting the debugging information from the object files (the size of which without the debugging info is manageable on my connection) using -gsplit-dwarf and then diffing the debug files against a copy of the build currently on the remote box.
The aim is to have a set of patches for the debug files of a project so that new code can be debugged at a remote location. The connection between the remote location and the local machine is slow and so minimization of the size of the patches is paramount but it should also be balanced with the run time of the tool. I have looked into bsdiff and xdelta as potential solutions and have run into a conundrum where xdetla is fast but too large and bsdiff is perfect in terms of size but the run time and memory requirements are a little higher than I would like it to be.
Is there a tool or approach I am missing or am I just going about this the wrong way? Some alternative to bsdiff and xdelta perhaps? I know that a tool like gbdserver won't work in this situation because of some of the requirements we have with the actual debugging. Could some alteration of bsdiff help the performance? And indeed if the approach I'm using is sound, what would be a good way to keep a copy of the build on the remote machine around to diff against.
The simplest way is to use "strip" to copy the debuginfo into a separate ".debug" file, and then use "strip" again to remove the debug info from the executable that you will deploy. The "strip" manual explains how to do this, look for the "--only-keep-debug" option.
After you do this, you can tell gdb about the separate debug info in various ways. The very best way is to use the "build-id" feature. This is what modern Linux distros do. However there are other ways as well. There's a whole section in the gdb manual about separate debug files.
The key point here is that you can start gdb on the stripped executable and it will find the separate debug info automatically. This data can all be local, so you won't need to deploy the debug info.
If you still care about shrinking debug info even when this is done, you can look at the "dwz" tool. This is a DWARF compressor. However this usually only matters if you plan to ship the debug info somewhere -- distros use it to make it easier to download debug info, but ordinary users won't really see the need.

What is the use of agentmain method in java instrumentation

I done some java bytecode instrumentation with -javaagent argument and premain method. But this is the first time i hear about agentmain method. I have some questions about this method. Here follows it.
Both premain and agentmain method have same use?
When agentmain method invoked?
What is the use of agentmain method in java instrumentation?
premain is invoked when an agent is started before the application. Agents invoked using premain are specified with the -javaagent switch.
agentmain is invoked when an agent is started after the application is already running. Agents started with agentmain can be attached programatically using the Sun tools API (for Sun/Oracle JVMs only -- the method for introducing dynamic agents is implementation-dependent).
An agent can have both a premain and an agentmain, but only one of the two will be called in a particular JVM invocation. In other words, your agent will either start with premain or agentmain, but not both.
You can find more information about this in the answer to the question Starting a Java agent after program start.

Why do malloc/new capture the callstack?

I have a 64bit application that runs as a service under Server 2003.
When I attach the VS Profiler or windbg I see lots of callstacks like the one below. I understand that processes spawned in the debugger (or profiler) use the debug heap etc... But this is not the case since this service is started by the OS and I am only attaching to it.
I do not understand why it is unwinding the stack. And the profiler shows that a measurable amount of time is spent doing that. Some more info:
• These are release bits built with vc9, running on Server 2003.
• System environment variable _NO_DEBUG_HEAP is set to 1.
• I am using Microsoft symbol servers.
Why is it capturing the stack trace? It seems it’s logging it.. but I can’t find where.
My objective is to verify that the app is really unwinding the stack and, if that is true, try to avoid it.
Any ideas?
Callstack
ntdll!RtlVirtualUnwind
ntdll!RtlpWalkFrameChain
ntdll!RtlCaptureStackBackTrace
ntdll!RtlpCaptureStackTraceForLogging
ntdll!RtlLogStackBackTrace
ntdll!RtlDebugAllocateHeap
ntdll!RtlAllocateHeapSlowly
ntdll!RtlAllocateHeap
MSVCR90!malloc
MSVCR90!operator new
It's possible that somebody has used gflags.exe to enable user stack trace capturing for this process or systemwide, or some other flag that requires tracking of CRT allocation operations.
You should be able to check this possibility in the registry using the info here.
Steve's answer is great, and eventually led to a resolution for our own server after a looong investigation. Our situation was slightly different, and the cause for the stack captures was due to a script periodically running a UMDH capture. UMDH will enable the stack collection when executed against that process for the first time. It gives the following warning:
UMDH has enabled allocation stack collection for the current running process.