Measuring FLOPS and memory traffic in a C++ program - c++

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

Related

Reasons for variations in run-time performance of identical code

when running some benchmarks of my C++ software, I obtained the following picture:
The plot shows the execution time in nanoseconds of one tick of the software.
The exact same tick is ran each time (one data point is one tick).
When testing in the simulated environment of valgrind, there is zero difference between each tick, and I don't have syscalls others than what clock_gettime may do.
I would like to understand what can cause the two "speeds" in which the tick seems to run. I disabled intel CPU sleep states which greatly helped (before that I had 4 lines like this), and what could be the causes for the outlier points. The scheduler used is linux's FIFO scheduler.
An interesting observation is that the tick times alternate between the two values of 9000 and 6700 ns; here's some data points:
9022
6605
9170
6756
9126
6594
9102
6744
9016
6643
8950
6638
9047
6662
edit:
just doing this in a loop in my thread:
auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
measure(std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count()));
is enough for me to see the alternating effect, though between 100 and 150 ns.

Interpreting PGI_ACC_TIME output

I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.
I generate some timing info by setting:
export PGI_ACC_TIME=1
And then running the program.
The following output results:
-bash-4.2$ ./a.out
libcupti.so not found
Accelerator Kernel Timing data
PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
_MyProgram NVIDIA devicenum=1
time(us): 97,667
75: data region reached 2 times
75: data copyin transfers: 3
device time(us): total=101 max=82 min=9 avg=33
76: compute region reached 1000 times
76: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=680,216 max=1,043 min=654 avg=680
95: compute region reached 1000 times
95: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=487,365 max=801 min=476 avg=487
110: data region reached 2000 times
110: data copyin transfers: 1000
device time(us): total=6,783 max=140 min=3 avg=6
125: data copyout transfers: 1000
device time(us): total=7,445 max=190 min=6 avg=7
real 0m3.864s
user 0m3.499s
sys 0m0.348s
It raises some questions:
I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?
If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?
This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?
Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?
The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?
The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?
(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)
Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".
The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.
Are max, min, and avg values based on per-run values of the kernel?
Yes.
4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.
Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.
5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data
(preparing to transfer, post-receipt operations)?
Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.
6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a
variable definition. Should line numbers be interpreted as "the loop
most closely following the indicated line number", or in some other
way?
It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.

Poor man's Python data parallelism doesn't scale with the # of processors? Why?

I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status shows a Cpus_allowed of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool() alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py, uncommenting os.waitpid(pid[i], 0) has no effect. And neither does uncommenting time.sleep() if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen() in lieu of os.fork(). With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen(), via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.
Why don't you use joblib.Parallel feature?
#!/usr/bin/env python
from joblib import Parallel, delayed
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def main():
print 'Running multifork.py.'
print Parallel(n_jobs=2)(delayed(function)(3) for _ in xrange(4))
if __name__ == "__main__":
main()
It seems that you have some bottleneck in your computations.
On your example you passes your data over the pipe which is not very fast method. To avoid this performance problem you should use shared memory. This is how multiprocessing and joblib.Parallel work.
Also you should remember that in single-thread case you don't have to serialize and deserialize the data but in case of multiprocess you have to.
Next, even if you have 8 cores it could be hyper-threading feature enabled with divides core performance into 2 threads, thus if you have 4 HW cores the OS things that there are 8 of them. There are a lot of advantages and disadvantages to use HT but the main thing is if you going to load all your cores to do some computations for a long time then you should disable it.
For example, I have a Intel(R) Core(TM) i3-2100 CPU # 3.10GHz with 2 HW cores and HT enabled. So in top I saw 4 cores. The time of computations for me are:
n_jobs=1 - 0:07.13
n_jobs=2 - 0:06.54
n_jobs=4 - 0:07.45
This is how my lscpu looks like:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6186.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Note at Thread(s) per core line.
So in your example there are not so many computations rather that data transfering. You application doesn't have time to get the advantages of parallelism. If you will have a long computation job (about 10 minutes) I think you will get it.
ADDITION:
I've taken a look at your function more detailed. I've replaced multiple execution with just one execution of function(3) function and run it within profiler:
$ /usr/bin/time -v python -m cProfile so.py
Output if quite long, you can view full version here (http://pastebin.com/qLBBH5zU). But the main thing is that the program lives most of the time in numpy.concatenate function. You can see it:
ncalls tottime percall cumtime percall filename:lineno(function)
.........
600 1.375 0.002 1.375 0.002 {numpy.core.multiarray.concatenate}
.........
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
.........
If you will run multiple instances of this program you will see that the time increases very much as the execution time of individual program instance. I've started 2 copy simultaneously:
$ /usr/bin/time -v python -m cProfile prog.py & /usr/bin/time -v python -m cProfile prog.py &
On the other hand I wrote a small fibo function:
def fibo(x):
arr = (0, 1)
for _ in xrange(x):
arr = (arr[-1], sum(arr))
return arr[0]
And replace the concatinate line with fibo(10000). In this case the execution time of single-instanced program is 0:22.82 while the execution time of two instances takes almost the same time per instance (0:24.62).
Based on this I think perhaps numpy uses some shared resource which leads to parallization problem. Or it can be numpy or scipy -specific issue.
And the last thing about the code, you have to replace the block below:
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
With the only line:
arr = np.concatenate( (arr, np.log(np.arange(2, 5002).repeat(3*200).reshape(-1,3*200))), axis=1 )
I am providing an answer here so as to not leave such a muddled record. First, I can report that most-serene frist's joblib code works, and note that it's shorter and that it also doesn't suffer from the limitations of the While True loop of my example, which only works efficiently with jobs that each take about the same amount of time. I see that the joblib project has current support and if you don't mind dependency upon a third-party library it may be an excellent solution. I may adopt it.
But, with my test function the time(1) real, the running time using the time wrapper, is about the same with either joblib or my poor man's code.
To answer the question of why the scaling of the running time in inverse proportion to the number of physical cores doesn't go as hoped, I had diligently prepared my test code that I presented here so as to produce similar output and to take about as long to run as the code of my actual project, per single-run without parallelism (and my actual project is likewise CPU-bound, not I/O-bound). I did so in order to test before undertaking the not-really-easy fitting of the simple code into my rather complicated project. But I have to report that in spite of that similarity, the results were much better with my actual project. I was surprised to see that I did get the sought inverse scaling of running time with the number of physical cores, more or less.
So I'm left supposing--- here's my tentative answer to my own question--- that perhaps the OS scheduler is fickle and very sensitive to the type of job. And there could be effects due to other processes that may be running even if, as in my case, the other processes are hardly using any CPU time (I checked; they weren't).
Tip #1: Never name your joblib test code joblib.py (you'll get an import error). Tip #2: Never rename your joblib.py file test code file and run the renamed file without deleting the joblib.py file (you'll get an import error).

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.
Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.
Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)

How can I profile the TLB hits and TLB misses in ubuntu

I have written a simple C++ program using for-loop to print the numbers from 1 to 100. I want to find the number of TLB hits and misses occurring for the particular program while running. Is there any possibility to get this data?
I am using Ubuntu. I have used perf tool. But it is producing different result in different times. I am very confused what part of my code is leading to such a huge number TLB hits, TLB misses and cache misses.
Ofcourse there might be other processes running simultaneously like Ubuntu GUI. But, does this result includes those process too?
command I used: perf stat -e dTLB-loads -e dTPerformance counter stats for './hellocc':
result: first time--
909,822 dTLB-loads
2,023 dTLB-misses # 0.22% of all dTLB cache hits
4,512 cache-misses
0.006821182 seconds time elapsed
LB-misses ./hellocc
result: Second time-- Performance counter stats for './hellocc':
907,810 dTLB-loads
2,045 dTLB-misses # 0.23% of all dTLB cache hits
4,533 cache-misses
0.006780635 seconds time elapsed
My simple code:
#include <iostream>
using namespace std;
int main
{
cout << "hello" << "\n";
for(int i=1; i <= 100; i = i + 1)
cout<< i << "\t" ;
return 0;
}
One way you could simulate this is using cachegrind, a part of valgrind.
Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.
While it's not your hardware, which I don't think you can get to, it's a good stand-in.
The cache behaviour of your program depends on what else is happening on your system at the time.
On a Linux system there are many processes running, such as the X server and window manager, the terminal, your editor, various daemon processes, and whatever else you have running (such as a web browser).
Depending on the vagaries of the scheduler, and the demands these other programs place on your system, your program's data may or may not stay in cache (the scheduler may even page your process entirely to the swap file), so the number of cache misses will vary depending on the other applications running.