How to associate events, metrics and source-level results for profiling a pyCUDA program using nvvp - profiling

When I try to profile my pyCUDA application using nvvp, it works for the most part. I can click on "Examine GPU Usage" and view a number of analysis results / suggestions for my code, such as "Low Compute / Memcpy Efficiency."
However, everytime that nvvp runs the program to perform an analysis, I see the following warning.
Some collected events, metrics or source-level results could not be associated with the session timeline. This may prevent event, metric and source-level results from being assigned to some kernels.
It looks like I might be able to get more detailed analysis if I do something to fix this. Does anyone know how to associate "collected events, metrics or source-level results with the session timeline"?

As seems in profiler documentation:
◦The Visual Profiler cannot correctly import profiler data generated by nvprof when the option --kernels kernel-filter is used. Visual Profiler reports a warning, "Some collected events or source-level results could not be associated with the session timeline." One workaround is to use the nvprof option --kernels :::1 to profile the first invocation for all kernels.
so you can try change this option

Related

How to profile a specfic code section in C++ code using ncu/CUPTI [duplicate]

I'm familiar with using nvprof to access the events and metrics of a benchmark, e.g.,
nvprof --system-profiling on --print-gpu-trace -o (file name) --events inst_issued1 ./benchmarkname
The
system-profiling on --print-gpu-trace -o (filename)
command gives timestamps for start time, kernel end times, power, temp and saves the info an nvvp files so we can view it in the visual profiler. This allows us to see what's happening in any section of a code, in particular when a specific kernel is running. My question is this--
Is there a way to isolate the events counted for only a section of the benchmark run, for example during a kernel execution? In the command above,
--events inst_issued1
just gives the instructions tallied for the whole executable. Thanks!
You may want to read the profiler documentation.
You can turn profiling on and off within an executable. The cuda runtime API for this is:
cudaProfilerStart()
cudaProfilerStop()
So, if you wanted to collect profile information only for a specific kernel, you could do:
#include <cuda_profiler_api.h>
...
cudaProfilerStart();
myKernel<<<...>>>(...);
cudaProfilerStop();
(instead of a kernel call, the above could be a function or code that calls kernels)
Excerpting from the documentation:
When using the start and stop functions, you also need to instruct the profiling tool to disable profiling at the start of the application. For nvprof you do this with the --profile-from-start off flag. For the Visual Profiler you use the Start execution with profiling enabled checkbox in the Settings View.
Also from the documentation for nvprof specifically, you can limit event/metric tabulation to a single kernel with a command line switch:
--kernels <kernel name>
The documentation gives additional usage possibilities.
The same methodologies are possible with nsight systems and nsight compute.
The CUDA profiler start/stop functions work exactly the same way. The nsight systems documentation explains how to run the profiler with capture control managed by the profiler api:
nsys [global-options] start -c cudaProfilerApi
or for nsight compute:
ncu [options] --profile-from-start off
Likewise, nsight compute can be conditioned via the command line to only profile specific kernels. The primary switch for this is -k to select kernels by name. In repetetive situations, the -c switch can be used to determine the number of the named kernel launches to profile, and the -s switch can be used to skip a number of launches before profiling.
These methodologies don't apply just to events and metrics, but to all profiling activity performed by the respective profilers.
The CUDA profiler API can be used in any executable, and does not require compilation with nvcc.
After looking into this a bit more, it turns out that kernel level information is also given for all kernels (w/o using --kernels and specifying them specifically) by using
nvprof --events <event names> --metrics <metric names> ./<cuda benchmark>
In fact, it gives output of the form
"Device","Kernel","Invocations","Event Name","Min","Max","Avg"
If a kernel is called multiple times in the benchmark, this allows you to see the Min, Max, Avg of the desired events for those kerne runs. Evidently the --kernels option on Cuda 7.5 Profiler allows each run of each kernel to be specified.

attaching callgrind/valgrind to program in mid-way of its execution

I use Valgrind to detect issues by running a program from the beginning.
Now I have memory/performance issues in a very specific moment in the program. Unfortunately there is no feasible way to make a shortcut to this place from the start.
Is there a way to instrument the c++ program ( Valgrind/Callgrind ) in its mid-way execution, like attaching to the process?
Already answered here:
How use callgrind to profiling only a certain period of program execution?
There is no way to use valgrind on an already running program.
For callgrind, you can however somewhat speed it up by only recording data later during execution.
For this, you might e.g. look at callgrind options
--instr-atstart=no|yes Do instrumentation at callgrind start [yes]
--collect-atstart=no|yes Collect at process/thread start [yes]
--toggle-collect=<func> Toggle collection on enter/leave function
You can also control such aspects from within your program.
Refer to valgrind/callgrind user manual for more details.
There are two things that Callgrind does that slows down execution.
Counting operations (the collect part)
Simulating the cache and branch predictor (the instrumentation part). This depends on the --cache-sim and --branch-sim options, which both default to "no". If you are using these options and you disable instrumentation then I expect that there will be some impact on the accuracy of the modelling as the cache and predictor won't be "warm" when they are toggled.
There are a few other approaches that you could use.
Use the client request mechanisms. This would require you to include a Valgrind header and add a few lines that use Valgrind macros CALLGRIND_START_INSTRUMENTATION/CALLGRIND_STOP_INSTRUMENTATION and CALLGRIND_TOGGLE_COLLECTto start and stop instrumentation/collection. See the manual for details. Then just run your application under Valgrind with --instr-atstart=no --collect-atstart=no
Use the gdb monitor commands. In this case you would have two terminals. In the first you would run Valgrind with --instr-atstart=no --collect-atstart=no --vgdb=yes. In the second terminal run gdb yourapplication then from the gdb prompt (gdb) target remote | vgdb. Then you can use the monitor commands as described in the manual, which include collect and instrumentation control.
callgrind_control, part of the Valgrind distribution. To be honest I've never used this.
I recently did some profiling using the first technique without cache/branch sim. When I used Callgrind on the whole run I saw a 23x runtime increase compared to running outside of Callgrind. When I profiled only the one function that I wanted to analyze, this fell to about a 5x slowdown. Obviously this will very greatly case by case.
Thanks all for help.
Seems It's been already answered here: How use callgrind to profiling only a certain period of program execution?
but not marked as answered for some reason.
Summary:
Starting callgrind with instrumentation off
valgrind --tool=callgrind --instr-atstart=no <PROG>
Controling instrumentation ( can be executed in other shell )
callgrind_control -i on/off

How can you trace execution of an embedded system emulted in QEMU?

I've built OpenWrt for x86 and I'm using QEMU to run it virtually.I'm trying to debug this system in real time. I need to see things like network traffic flowing etc.
I can attach gdb remotely and execute (mostly) step by step with break points. I really want trace points though. I don't want to pause execution and loose network flow. When I tried setting trace points using tstart, I see the message "Target does not support this command". I did a bit of reading of the gdb documentation and from what I can tell the gdb stub that runs to intercept normal execution in QEMU does not support trace points.
From here I started looking at other tools and ran across PANDA (https://github.com/panda-re/panda). As I understand PANDA will capture a complete system trace in a log and allow for replay. I think this tool is supposed to do what I need, but I cannot seem to replay the results. I see the logs, I just can't replay them.
Now, I'm a bit stuck on what other tools/options I might have to actually trace a running embedded system. Are there any good tools you can recommend? Or perhaps another method I've missed?
If you want to see the system calls and signals use strace.
Strace can also be used with running process and it can put the output in a log file if required.
In OpenWrt it is possible to build with ftrace. Ftrace has much of the functionality I required but not all.
To build with ftrace, the option for ftrace must be selected in the build menu. Additionally there are a variety of tracer options that must also be enabled.
The trace-cmd (ftrace) is located in menuconfig/Development
Tracing support is under menuconfig/Global build settings/Compile the kernel with tracing support and includes: Trace system calls, Trace process context switches and events, and Function tracer (Function graph tracer, Enable/disable function tracing dynamically, and Function profiler)
I'm also planning to build a custom GDB stub to do this a little bit better as I also want to see the data passed to the functions not just the function calls.

Where can CUDA spend the time in a kernel call? [duplicate]

I'm familiar with using nvprof to access the events and metrics of a benchmark, e.g.,
nvprof --system-profiling on --print-gpu-trace -o (file name) --events inst_issued1 ./benchmarkname
The
system-profiling on --print-gpu-trace -o (filename)
command gives timestamps for start time, kernel end times, power, temp and saves the info an nvvp files so we can view it in the visual profiler. This allows us to see what's happening in any section of a code, in particular when a specific kernel is running. My question is this--
Is there a way to isolate the events counted for only a section of the benchmark run, for example during a kernel execution? In the command above,
--events inst_issued1
just gives the instructions tallied for the whole executable. Thanks!
You may want to read the profiler documentation.
You can turn profiling on and off within an executable. The cuda runtime API for this is:
cudaProfilerStart()
cudaProfilerStop()
So, if you wanted to collect profile information only for a specific kernel, you could do:
#include <cuda_profiler_api.h>
...
cudaProfilerStart();
myKernel<<<...>>>(...);
cudaProfilerStop();
(instead of a kernel call, the above could be a function or code that calls kernels)
Excerpting from the documentation:
When using the start and stop functions, you also need to instruct the profiling tool to disable profiling at the start of the application. For nvprof you do this with the --profile-from-start off flag. For the Visual Profiler you use the Start execution with profiling enabled checkbox in the Settings View.
Also from the documentation for nvprof specifically, you can limit event/metric tabulation to a single kernel with a command line switch:
--kernels <kernel name>
The documentation gives additional usage possibilities.
The same methodologies are possible with nsight systems and nsight compute.
The CUDA profiler start/stop functions work exactly the same way. The nsight systems documentation explains how to run the profiler with capture control managed by the profiler api:
nsys [global-options] start -c cudaProfilerApi
or for nsight compute:
ncu [options] --profile-from-start off
Likewise, nsight compute can be conditioned via the command line to only profile specific kernels. The primary switch for this is -k to select kernels by name. In repetetive situations, the -c switch can be used to determine the number of the named kernel launches to profile, and the -s switch can be used to skip a number of launches before profiling.
These methodologies don't apply just to events and metrics, but to all profiling activity performed by the respective profilers.
The CUDA profiler API can be used in any executable, and does not require compilation with nvcc.
After looking into this a bit more, it turns out that kernel level information is also given for all kernels (w/o using --kernels and specifying them specifically) by using
nvprof --events <event names> --metrics <metric names> ./<cuda benchmark>
In fact, it gives output of the form
"Device","Kernel","Invocations","Event Name","Min","Max","Avg"
If a kernel is called multiple times in the benchmark, this allows you to see the Min, Max, Avg of the desired events for those kerne runs. Evidently the --kernels option on Cuda 7.5 Profiler allows each run of each kernel to be specified.

How use callgrind to profiling only a certain period of program execution?

I want to use valgrind to do some profiling, since it does not need re-build the program. (the program I want to profile is already build with “-g")
But valgrind(callgrind) is quite slow ... so here's what I to do:
start the server ( I want to profile that server)
kind of attach to that server
before I do some operation on server, start collect profile data
after the operation is done, end collecting profile data
analyze the profiling data.
I can do this kind of thing using sun studio on Solaris. (using dbx ). I just want to know is it possible to do the same thing using valgrind(callgrind)?
Thanks
You should look at callgrind documentation, and read about callgrind_control.
Launch your app : valgrind --tool=callgrind --instr-atstart=no your_server.x
See 1.
start collect profile data: callgrind_control -i on
end collect profile data: callgrind_control -i off
Analyze data with kcachegrind or callgrind_annotate/cg_annotate
For profiling only some function you can also find useful CALLGRIND_START_INSTRUMENTATION and CALLGRIND_STOP_INSTRUMENTATION from <valgrind/callgrind.h> header and using callgrind's --instr-atstart=no option as suggested in Doomsday's answer.
You don't say what OS - I'm assuming Linux - in which case you might want to look at oprofile (free) or Zoom (not free, but you can get an evaluation licence), both of which are sampling profilers and can profile existing code without re-compilation. Zoom is much nicer and easier to use (it has a GUI and some nice additional features), but you probably already have oprofile on your system.