Interpreting PGI_ACC_TIME output - c++

I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.
I generate some timing info by setting:
export PGI_ACC_TIME=1
And then running the program.
The following output results:
-bash-4.2$ ./a.out
libcupti.so not found
Accelerator Kernel Timing data
PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
_MyProgram NVIDIA devicenum=1
time(us): 97,667
75: data region reached 2 times
75: data copyin transfers: 3
device time(us): total=101 max=82 min=9 avg=33
76: compute region reached 1000 times
76: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=680,216 max=1,043 min=654 avg=680
95: compute region reached 1000 times
95: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=487,365 max=801 min=476 avg=487
110: data region reached 2000 times
110: data copyin transfers: 1000
device time(us): total=6,783 max=140 min=3 avg=6
125: data copyout transfers: 1000
device time(us): total=7,445 max=190 min=6 avg=7
real 0m3.864s
user 0m3.499s
sys 0m0.348s
It raises some questions:
I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?
If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?
This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?
Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?
The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?
The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?
(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)

Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".
The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.
Are max, min, and avg values based on per-run values of the kernel?
Yes.
4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.
Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.
5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data
(preparing to transfer, post-receipt operations)?
Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.
6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a
variable definition. Should line numbers be interpreted as "the loop
most closely following the indicated line number", or in some other
way?
It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.

Related

Measuring FLOPS and memory traffic in a C++ program

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

Reasons for variations in run-time performance of identical code

when running some benchmarks of my C++ software, I obtained the following picture:
The plot shows the execution time in nanoseconds of one tick of the software.
The exact same tick is ran each time (one data point is one tick).
When testing in the simulated environment of valgrind, there is zero difference between each tick, and I don't have syscalls others than what clock_gettime may do.
I would like to understand what can cause the two "speeds" in which the tick seems to run. I disabled intel CPU sleep states which greatly helped (before that I had 4 lines like this), and what could be the causes for the outlier points. The scheduler used is linux's FIFO scheduler.
An interesting observation is that the tick times alternate between the two values of 9000 and 6700 ns; here's some data points:
9022
6605
9170
6756
9126
6594
9102
6744
9016
6643
8950
6638
9047
6662
edit:
just doing this in a loop in my thread:
auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
measure(std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count()));
is enough for me to see the alternating effect, though between 100 and 150 ns.

RAM consumption regarding cores/

I am working on a 31 ,available, Go of RAM, 12 cores Linux KUbuntu computer.
I produce simulations which calculate functions over 4 dimensions (x,y,z,t).
I define my dimensions as arrays that I numpy.meshgrid for use. So, for each point of time, I calculate for each point x,y,z the result. It comes as heavy calculations with heavy data.
First, I learned how to use it with only one core. It works well and whatever are the size of my "boxs" ( x,y,z). Because of the fact I work a lot with Fourier transform, I define x,y,z,t as powers of 2 : 64,128,256,...
I can,without dificulties, go to x = y = z = t = 512, even if it takes a lot of time to run it (which makes sense). When I do that, I use around 20-30% of the available RAM of the computer. Great.
Then I wanted to use more cores. So I implemented this code :
import multiprocessing as mp
pool = mp.Pool(processes=8)
results = [pool.apply_async(conv_green, args=(tstep, S_, )) for tstep in t]
So here I ask my script to use 8 cores, and define my results as the use of the function "conv_green" with the args "tstep,S_" all along t.
It works pretty well, use 8 cores as expected BUT I can not run any more simulations who use figures equal or above to 512 for x,y,z,t.
This is where my problem is. Technically, switching from the mono core system to multi chanegd nothing to the routine of my calculations. I do not understand why I have enough RAM for 512... in mono core and why,sudenly, when I switch to multi cores, computer does not even want to launch it ( and the error occurs at the" results = pool.apply ..." line)
So if you guys know how this works and why I get this "treshold", thanks for helping me solving out !
Best regards.
PS : this is the error which pops out when it crashes with 512 in multi cores :
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist
packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/alexis/Heat/Simu⁄Lecture Propre/Test Tkinter/Simulation N spots SCAN Tkinter.py", line 280, in
XYslice = array([p.get()[0] for p in results])
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call
For multiprocessing in any language each thread will need private storage which it can write to without interference from the other threads. As soon as interference is possible the data structure has to be locked, which (in the worst case) takes us back to single threading.
It would appear that your large data structure is being copied for each of the threads, effectively multiplying your memory usage by eight when you have eight processors ... or up to 200% of your available RAM.
The best solution would be to prevent the unnecessary copying.
If that's not feasible then all you can do is limit the number of processors it can run on, four should be ok in your instance but make sure your machine has lots of swap space. The swap space also gives you some play to allow the virtual memory to exceed the physical RAM, if the "working set" is small enough you may be able to significantly exceed your physical RAM given enough swap.

DISK_PERFORMANCE struct's ReadTime and WriteTime members

I'm trying to disect the DISK_PERFORMANCE struct but can't seem to find any decent documentation. Does anyone know what the ReadTime and WriteTime members mean?
The MSDN claims, "The time it takes to complete a read/write", but the read/write of what? Also what is it measured in?
Update: I didn't know, but I do now.
I wasn't familiar with DISK_PERFORMANCE but I am familiar with the HKEY_PERFORMANCE_DATA performance data.
The Avg. Disk sec/Read counter reports the average time per read (and there's another counter for writes). This counter has the PERF_AVERAGE_TIMER type. The data that you actually get is the total time spent reading and the total number of operations. You acquire two samples and subtract the values to get the total time spent reading during the sample interval and the total number of operations during the sample interval. You then divide these two values to get the avarage time per read.
The clock frequency is also returned along with the performance data so you can convert the time units to seconds.
Assuming that DISK_PERFORMANCE works similarly then ReadTime and WriteTime will be the total time spent on all reads and writes. Unfortunately, it's not obvious what clock frequency it's using, but it's most likely using the value from QueryPerformanceFrequency. I'd try that and see if the results (for average read and write time) compare to the values you see in perfmon.
The header file (winioctl.h) doesn't contain much useful information, but it does say that the IOCTL_DISK_PERFORMANCE request is forwarded to either the DISKPERF filter driver or the SIMBAD filter driver (which simulates disk faults). Which means you should get consistent results across different device types.
Update
So I did the research. Some sample data:
3579000, 42, 801881, 42, 4.46325577
3749000, 79, 839970, 79, 4.46325464
4076000, 66, 913235, 66, 4.463254255
3614000, 77, 809723, 77, 4.463254718
1465000, 28, 328236, 28, 4.46325205
Each line has the deltas of the ReadTime and ReadCount members from DISK_PERFORMANCE (sampled once per second) followed by the corresponding values from HKEY_PERFORMANCE_DATA, followed by the first ReadTime divided by the second.
The HKEY_PERFORMANCE_DATA values are in QueryPerformanceFrequency units, 2240517Hz on my PC. 10,000,000 / 2240517 = 4.4633 so the DISK_PERFORMANCE metrics seem to be in 100ns (=10MHz) units.
To reiterate, DISK_PERFORMANCE::ReadTime is the total time spend on reads in 100ns units.
In general, like all DeviceIOControl's, it means whatever the underlying driver means. As you can deduce from the StorageManagerName member, there are multiple drivers that use this struct.

How to get total cpu usage in Linux using C++

I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------