Tracing CPU cycle with Pintool? - profiling

I am trying to run SPEC17 benchmark using Pintool. While doing so I need to log CPU cycle and memory address accessed. Pintool have pinatrace tool to log memory addresses but I can not find anything to log CPU cycles. Is there a way to do so?

Related

Micro benchmarking C++ Linux

I am benchmarking a function of my c++ program using inline rdtsc(1st and last instruction in the function) My setup has isolated cores and hyper threading is off and the frequency is 3.5Mhz.
I cannot afford more than 1000 cpu cycles so i count the percentage of calls taking more than 1000 cpu cycles and that is approximately 2-3%. The structure being accessed in the code is huge and can certainly result in cache miss. But a cache miss is 300-400 cpu cycles.
Is there a problem with rdtsc benchmarking? If not, what else can cause a 2-3% of my cases going through the same set of instructions abruptly high number of cycles
I want help to understand what i should look for to understand this 2-3% of my WC(worst cases)
Often rare "performance noise" like you describe is caused by context switches in the timed region, where your process happened to exceed its scheduler quanta during your interval and some other process was scheduled to run on the core. Another possibility is a core migration by the kernel.
When you say "My setup has isolated cores", are you actually pinning your process to specific cores using a tool (e.g. hwloc)? This can greatly help to get reproducible results. Have you checked for other daemon processes that might also be eligible to run on your core?
Have you tried measuring your code using a sampling profiler like gprof or HPCToolkit? These tools provide alot more context and behavioral information that can be difficult to discover from manual instrumentation.

Callgrind / kcachegrind why does running a program in valgrind increase sysCall time?

I've been profiling some code which likely spends a lot of it's execution time in system calls. Timing some functions manually and with callgrind, callgrind reports system call times of around 20, 30 or 40 times longer than simply timing the function (which would ofcourse include CPU time as well).
--collect-systime=ON was used to collect this sysCall time for each function.
As far as I know, callgrind works by counting the CPU instructions and for timing system calls simply lets the OS do the work and doesn't interfere. I don't understand why the time spent in sysCalls increases when profiling with callgrind, can anyone elaborate?
Is callgrind still a useful tool to profile time spent in sysCalls?
Can you try --collect-systime=usec and --collect-systime=nsec to see if they are significantly different? usec might be a bit faster.
When --collect-systime is specified, Valgrind will be calling one of the various time syscalls for each of the client application syscalls. I would expect that to add a substantial overhead, particularly if your client application is calling very many "fast" syscalls.

Cuda zero-copy performance

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?
I have a kernel that uses the zero-copy feature and with NVVP I see the following:
Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.
When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.
However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.
The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?
Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:
The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
The memory operation had thread address divergence requiring access to multiple cache lines.
The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
The LSU unit resources are full and the instruction needs to be replayed when the resource are available.
The latency to
L2 is 200-400 cycles
device memory (dram) is 400-800 cycles
zero copy memory over PCIe is 1000s of cycles
The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.
The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.
The profiler's exposes the following counters:
gld_throughput
l1_cache_global_hit_rate
dram_{read, write}_throughput
l2_l1_read_hit_rate
In the zero copy case all of these metrics should be much lower.
The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?
I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.
For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.
M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.

Parameters to watch while running an application on linux?

I am running an application overnight on my Linux xscale device. I am looking for things,which would increase with the increase in the amount of time of running.
One thing,is the memory. If you observe the memory on the xscale systems,the free memory would start decreasing,but you will see an increase in the cached memory. What are the other parameters which we can observe ,e.g. can we observe the amount of stack or heap usage?
If the application is developed by you, i would recommend the following heap profiler to use for getting more deeper details.
http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools
vmstat is often a good thing to run; it gives detailed information about memory, swap, & cpu usage, as well as the average number of interrupts & context switches per second. Give it a number, n, and it will run continuously every n seconds.