file io with /dev/random takes too long - c++

I want to write a program to generate really random number using /dev/random on linux, but later I find the running time of it is quite unacceptable occasionally. C version of it run fast consistently.
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc,char*argv[])
{
ifstream random("/dev/random", ios_base::in);
int t;
random.read(reinterpret_cast<char*>(&t), sizeof(t));
cout << t << endl;
random.close();
return 0;
}
The time statistic of running time
$: time ./random
-1040810404
real 0m0.004s
user 0m0.000s
sys 0m0.000s
$: time ./random
-1298913761
real 0m4.119s
user 0m0.000s
sys 0m0.000s

You have likely drained the entropy pool. Creating (ok harvesting) entropy is based on device drivers which sample qualities of the physical world which are mostly unpredictable. But if those devices are not very active or if the entropy-producing algorithm stalls, your reads from /dev/random will too.
Can you use /dev/urandom? If not, you should look into ways that you can produce more entropy in a more deterministic fashion.
Here are some suggestions from an article regarding a similar problem:
Involve an audio entropy daemon like AED to gather noise from your datacenter with an open microphone, maybe combine it with a webcam noise collector like VED. Other sources are talking about “Cryptographic Randomness from Air Turbulence in Disk devices“. :)
Use the Entropy Gathering Daemon to collect weaker entropy from randomness of userspace programs.

Related

Measuring FLOPS and memory traffic in a C++ program

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

OpenCV based programs optimization on minimal linux embedded systems

I'm building my own Embedded Linux OS for Raspberry PI3 using Buildroot. This OS will be used to handle several applications, one of them performs objects detection based on OpenCV (v3.3.0).
I started with Raspbian Jessy + Python but it turned out that it takes a lot of time to execute a simple example, So I decided to design my own RTOS with Optimized features + C++ development instead of Python.
I thought that with these optimizations the 4 cores of RPI + the 1GB RAM will handle such applications. The problem is that even with these things, the simplest Computer Vision programs take a lot of time.
PC vs. Raspberry PI3 Comparaison
This is a simple program I wrote to have an idea of the order of magnitude of execution time of each part of the program.
#include <stdio.h>
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include <time.h> /* clock_t, clock, CLOCKS_PER_SEC */
using namespace cv;
using namespace std;
int main()
{
setUseOptimized(true);
clock_t t_access, t_proc, t_save, t_total;
// Access time.
t_access = clock();
Mat img0 = imread("img0.jpg", IMREAD_COLOR);// takes ~90ms
t_access = clock() - t_access;
// Processing time
t_proc = clock();
cvtColor(img0, img0, CV_BGR2GRAY);
blur(img0, img0, Size(9,9));// takes ~18ms
t_proc = clock() - t_proc;
// Saving time
t_save = clock();
imwrite("img1.jpg", img0);
t_save = clock() - t_save;
t_total = t_access + t_proc + t_save;
//printf("CLOCKS_PER_SEC = %d\n\n", CLOCKS_PER_SEC);
printf("(TEST 0) Total execution time\t %d cycles \t= %f ms!\n", t_total,((float)t_total)*1000./CLOCKS_PER_SEC);
printf("---->> Accessing in\t %d cycles \t= %f ms.\n", t_access,((float)t_access)*1000./CLOCKS_PER_SEC);
printf("---->> Processing in\t %d cycles \t= %f ms.\n", t_proc,((float)t_proc)*1000./CLOCKS_PER_SEC);
printf("---->> Saving in\t %d cycles \t= %f ms.\n", t_save,((float)t_save)*1000./CLOCKS_PER_SEC);
return 0;
}
Results of Execution on an i7 PC
Results of Execution on Raspberry PI (Generated OS from Buildroot)
As you can see there is a huge difference. What I need is to optimize every single detail so that this example processing step occurs in "near" real-time at in a maximum 15ms processing time instead of the 44ms. So these are my questions:
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Is there any other possibilities instead of OpenCV?
Should I use C instead of C++?
Any hardware improvements you recommend?
Well as i understand, you want to get about 30-40fps. In case of your I7: it is fast and having tone of acceleration techniques enabled default by itel. In case of raspberry pi: well, we love it but it is slow, especially for image processing program.
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
You should include some acceleration library for arm and re-compiled opencv again with those features enabled.
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Paralleling your code so it could run on 4 cores
Is there any other possibilities instead of OpenCV?
Ask your self first, what features do you need from OpenCV.
Should I use C instead of C++?
Changing language will not help you at all, stay and love C++. It is a beautiful language and very fast
Any hardware improvements you recommend?
How about other board with mali GPU supported. So you could run opencv code directly on GPU, that will boost up your speed a lot.

how to run Clock-gettime correctly in Vxworks to get accurate time

I am trying to measure time take by processes in C++ program with linux and Vxworks. I have noticed that clock_gettime(CLOCK_REALTIME, timespec ) is accurate enough (resolution about 1 ns) to do the job on many Oses. For a portability matter I am using this function and running it on both Vxworks 6.2 and linux 3.7.
I ve tried to measure the time taken by a simple print:
#define <timers.h<
#define <iostream>
#define BILLION 1000000000L
int main(){
struct timespec start, end; uint32_t diff;
for(int i=0; i<1000; i++){
clock_gettime(CLOCK_REALTME, &start);
std::cout<<"Do stuff"<<std::endl;
clock_gettime(CLOCK_REALTME, &end);
diff = BILLION*(end.tv_sec-start.tv_sec)+(end.tv_nsec-start.tv_nsec);
std::cout<<diff<<std::endl;
}
return 0;
}
I compiled this on linux and vxworks. For linux results seemed logic (average 20 µs). But for Vxworks, I ve got a lot of zeros , then 5000000 ns , then a lot of zeros...
PS , for vxwroks, I runned this app on ARM-cortex A8, and results seemed random
have anyone seen the same bug before,
In vxworks, the clock resolution is defined by the system scheduler frequency. By default, this is typically 60Hz, however may be different dependant on BSP, kernel configuration, or runtime configuration.
The VxWorks kernel configuration parameters SYS_CLK_RATE_MAX and SYS_CLK_RATE_MIN define the maximum and minimum values supported, and SYS_CLK_RATE defines the default rate, applied at boot.
The actual clock rate can be modified at runtime using sysClkRateSet, either within your code, or from the shell.
You can check the current rate by using sysClkRateGet.
Given that you are seeing either 0 or 5000000ns - which is 5ms, I would expect that your system clock rate is ~200Hz.
To get greater resolution, you can increase the system clock rate. However, this may have undesired side effects, as this will increase the frequency of certain system operations.
A better method of timing code may be to use sysTimestamp which is typically driven from a high frequency timer, and can be used to perform high-res timing of short-lived activities.
I think in vxworks by default the clock resolution is 16.66ms which you can get by calling clock_getres() function. You can change the resolution by calling sysclkrateset() function(max resolution supported is 200us i guess by passing 5000 as argument to sysclkrateset function). You can then calculate the difference between two timestamps using difftime() function

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.
Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.
Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)

How can I profile the TLB hits and TLB misses in ubuntu

I have written a simple C++ program using for-loop to print the numbers from 1 to 100. I want to find the number of TLB hits and misses occurring for the particular program while running. Is there any possibility to get this data?
I am using Ubuntu. I have used perf tool. But it is producing different result in different times. I am very confused what part of my code is leading to such a huge number TLB hits, TLB misses and cache misses.
Ofcourse there might be other processes running simultaneously like Ubuntu GUI. But, does this result includes those process too?
command I used: perf stat -e dTLB-loads -e dTPerformance counter stats for './hellocc':
result: first time--
909,822 dTLB-loads
2,023 dTLB-misses # 0.22% of all dTLB cache hits
4,512 cache-misses
0.006821182 seconds time elapsed
LB-misses ./hellocc
result: Second time-- Performance counter stats for './hellocc':
907,810 dTLB-loads
2,045 dTLB-misses # 0.23% of all dTLB cache hits
4,533 cache-misses
0.006780635 seconds time elapsed
My simple code:
#include <iostream>
using namespace std;
int main
{
cout << "hello" << "\n";
for(int i=1; i <= 100; i = i + 1)
cout<< i << "\t" ;
return 0;
}
One way you could simulate this is using cachegrind, a part of valgrind.
Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.
While it's not your hardware, which I don't think you can get to, it's a good stand-in.
The cache behaviour of your program depends on what else is happening on your system at the time.
On a Linux system there are many processes running, such as the X server and window manager, the terminal, your editor, various daemon processes, and whatever else you have running (such as a web browser).
Depending on the vagaries of the scheduler, and the demands these other programs place on your system, your program's data may or may not stay in cache (the scheduler may even page your process entirely to the swap file), so the number of cache misses will vary depending on the other applications running.