Profiling OpenCV using OProfile - c++

I have this basic OpenCV program:
#include <iostream>
#include "opencv2/opencv.hpp"
int main(){
std::cout<<"Reading Image..."<<std::endl;
cv::Mat img = cv::imread("all_souls_000000.jpg", cv::IMREAD_GRAYSCALE);
if(!img.data)
std::cerr<<"Error reading image"<<std::endl;
return 0;
}
Which creates the executable ReadImage. I want to profile it using OProfile. However, running:
operf ./ReadImage > ReadImage.log
Returns:
Kernel profiling is not possible with current system config.
Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
operf: Profiler started
* * * * WARNING: Profiling rate was throttled back by the kernel * * * *
The number of samples actually recorded is less than expected, but is
probably still statistically valid. Decreasing the sampling rate is the
best option if you want to avoid throttling.
Profiling done.
Why this happens? What is the best way to profile OpenCV?

I was able to run operf on an opencv app, with this result, is this what you are looking for?
Profiling started at Tue Jan 31 16:52:48 2017
Profiling stopped at Tue Jan 31 16:52:53 2017
-- OProfile/operf Statistics --
Nr. non-backtrace samples: 337018
Nr. kernel samples: 5603
Nr. user space samples: 331415
Nr. samples lost due to sample address not in expected range for domain: 0
Nr. lost kernel samples: 0
Nr. samples lost due to sample file open failure: 0
Nr. samples lost due to no permanent mapping: 0
Nr. user context kernel samples lost due to no app info available: 0
Nr. user samples lost due to no app info available: 0
Nr. backtraces skipped due to no file mapping: 0
Nr. hypervisor samples dropped due to address out-of-range: 0
Nr. samples lost reported by perf_events kernel: 0

Related

Cuda Multi-GPU Latency

I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

Ethereum: Why do I keep creating DAG files?

After reading on another question on Stack, I understood that a DAG file stands for Directed Acyclic Graph.
However, I do not understand how it is used and when I typed ethminer -G, I started to see Creating DAG. XX% done DAG 16:37:39.331|ethminer Generating DAG file. Progress: XX %. It has already been the third time since it reached 100% and just keeps on restarting the same process after printing:
Creating DAG. 100% done...miner 16:22:32.015|ethminer Got work package:
miner 16:22:32.015|ethminer Header-hash: xxx
miner 16:22:32.015|ethminer Seedhash: xxx
miner 16:22:32.015|ethminer Target: xxx
ℹ 16:22:32.041|gpuminer0 workLoop 1 #xxx… #xxx…
ℹ 16:22:32.041|gpuminer0 Initialising miner...
[OPENCL]:Using platform: NVIDIA CUDA
[OPENCL]:Using device: GeForce 840M(OpenCL 1.2 CUDA)
miner 16:22:32.542|ethminer Mining on PoWhash #xxx… : 0 H/s = 0 hashes / 0.5 s
miner 16:22:32.542|ethminer Grabbing DAG for #xxx…
[OPENCL]:Printing program log
[OPENCL]:
[OPENCL]:Creating one big buffer for the DAG
[OPENCL]:Loading single big chunk kernels
[OPENCL]:Mapping one big chunk.
[OPENCL]:Creating buffer for header.
[OPENCL]:Creating mining buffer 0
[OPENCL]:Creating mining buffer 1
I precise that I am using Ubuntu 16.04 and CUDA 8.0 with drivers 367 for my NVIDIA.
Ethhash, the proof-of-work algorithm used by ethereum was designed to be memory-hard. Part of this is the requirement of for the entire DAG file to be stored in a GPU's memory.
There is better explanation here: https://ethereum.stackexchange.com/questions/1993/what-actually-is-a-dag/1996
The reason why ethminer is restarting is because your NVIDIA GeForce 840M has only has 2 GB of memory whereas at the time when posted this question, the DAG size on the ethereum network was ~3 GB.

Meaning of very high Elapsed(wall clock) time and low System time in Linux

I have a C++ binary and I am trying to measure it's worst case performance.
I executed it with
/usr/bin/time -v < command >
And result was as
User time (seconds): 161.07
System time (seconds): 16.64
Percent of CPU this job got: 7%
Elapsed (wall clock) time (h:mm:ss or m:ss): 39:44.46
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 19889808
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1272786
Voluntary context switches: 233597
Involuntary context switches: 138
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
How do I interpret this result, what is causing this application to take this much time?
There is no waiting for user input, it basically deals with large text file and database.
I am looking at it from Linux(OS) perspective.Is it too many context switches(Round robin Scheduling in Linux) that has caused this?
The best thing you can do is to run it with a profiler like gprof, gperftools, callgrind (part of valgrind) or (the best in my opinion) Intel VTune. They can show you what is going one behind the code. And you'd better have the debug symbols (!= than compiling without optimization) to get a clear picture about that. Otherwise you can just have "best guesses" of what is going one under the hood...
As I said, I'm biased towards VTune as it is fast and it displays a lot of useful info. Take a look here at an example:
Vtune example

How to get total cpu usage in Linux using C++

I am trying to get total cpu usage in %. First I should start by saying that "top" will simply not do, as there is a delay between cpu dumps, it requires 2 dumps and several seconds, which hangs my program (I do not want to give it its own thread)
next thing what I tried is "ps" which is instant but always gives very high number in total (20+) and when I actually got my cpu to do something it stayed at about 20...
Is there any other way that I could get total cpu usage? It does not matter if it is over one second or longer periods of time... Longer periods would be more useful, though.
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I agree with this answer above. The cpu line in this file gives the total number of "jiffies" your system has spent doing different types of processing.
What you need to do is take 2 readings of this file, seperated by whatever interval of time you require. The numbers are increasing values (subject to integer rollover) so to get the %cpu you need to calculate how many jiffies have elapsed over your interval, versus how many jiffies were spend doing work.
e.g.
Suppose at 14:00:00 you have
cpu 4698 591 262 8953 916 449 531
total_jiffies_1 = (sum of all values) = 16400
work_jiffies_1 = (sum of user,nice,system = the first 3 values) = 5551
and at 14:00:05 you have
cpu 4739 591 289 9961 936 449 541
total_jiffies_2 = 17506
work_jiffies_2 = 5619
So the %cpu usage over this period is:
work_over_period = work_jiffies_2 - work_jiffies_1 = 68
total_over_period = total_jiffies_2 - total_jiffies_1 = 1106
%cpu = work_over_period / total_over_period * 100 = 6.1%
Try reading /proc/loadavg. The first three numbers are the number of processes actually running (i.e., using a CPU), averaged over the last 1, 5, and 15 minutes, respectively.
http://www.linuxinsight.com/proc_loadavg.html
Read /proc/cpuinfo to find the number of CPU/cores available to the systems.
Call the getloadavg() (or alternatively read the /proc/loadavg), take the first value, multiply it by 100 (to convert to percents), divide by number of CPU/cores. If the value is greater than 100, truncate it to 100. Done.
Relevant documentation: man getloadavg and man 5 proc
N.B. Load average, usual to *NIX systems, can be more than 100% (per CPU/core) because it actually measures number of processes ready to be run by scheduler. With Windows-like CPU metric, when load is at 100% you do not really know whether it is optimal use of CPU resources or system is overloaded. Under *NIX, optimal use of CPU loadavg would give you value ~1.0 (or 2.0 for dual system). If the value is much greater than number CPU/cores, then you might want to plug extra CPUs into the box.
Otherwise, dig the /proc file system.
cpu-stat is a C++ project that permits to read Linux CPU counter from /proc/stat .
Get CPUData.* and CPUSnaphot.* files from cpu-stat's src directory.
Quick implementation to get overall cpu usage:
#include "CPUSnapshot.h"
#include <chrono>
#include <thread>
#include <iostream>
int main()
{
CPUSnapshot previousSnap;
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
CPUSnapshot curSnap;
const float ACTIVE_TIME = curSnap.GetActiveTimeTotal() - previousSnap.GetActiveTimeTotal();
const float IDLE_TIME = curSnap.GetIdleTimeTotal() - previousSnap.GetIdleTimeTotal();
const float TOTAL_TIME = ACTIVE_TIME + IDLE_TIME;
int usage = 100.f * ACTIVE_TIME / TOTAL_TIME;
std::cout << "total cpu usage: " << usage << " %" << std::endl;
}
Compile it:
g++ -std=c++11 -o CPUUsage main.cpp CPUSnapshot.cpp CPUData.cpp
cat /proc/stat
http://www.linuxhowtos.org/System/procstat.htm
I suggest two files to starting...
/proc/stat and /proc/cpuinfo.
http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt
have a look at this C++ Lib.
The information is parsed from /proc/stat. it also parses memory usage from /proc/meminfo and ethernet load from /proc/net/dev
----------------------------------------------
current CPULoad:5.09119
average CPULoad 10.0671
Max CPULoad 10.0822
Min CPULoad 1.74111
CPU: : Intel(R) Core(TM) i7-10750H CPU # 2.60GHz
----------------------------------------------
network load: wlp0s20f3 : 1.9kBit/s : 920Bit/s : 1.0kBit/s : RX Bytes Startup: 15.8mByte TX Bytes Startup: 833.5mByte
----------------------------------------------
memory load: 28.4% maxmemory: 16133792 Kb used: 4581564 Kb Memload of this Process 170408 KB
----------------------------------------------