OpenCV based programs optimization on minimal linux embedded systems - c++

I'm building my own Embedded Linux OS for Raspberry PI3 using Buildroot. This OS will be used to handle several applications, one of them performs objects detection based on OpenCV (v3.3.0).
I started with Raspbian Jessy + Python but it turned out that it takes a lot of time to execute a simple example, So I decided to design my own RTOS with Optimized features + C++ development instead of Python.
I thought that with these optimizations the 4 cores of RPI + the 1GB RAM will handle such applications. The problem is that even with these things, the simplest Computer Vision programs take a lot of time.
PC vs. Raspberry PI3 Comparaison
This is a simple program I wrote to have an idea of the order of magnitude of execution time of each part of the program.
#include <stdio.h>
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include <time.h> /* clock_t, clock, CLOCKS_PER_SEC */
using namespace cv;
using namespace std;
int main()
{
setUseOptimized(true);
clock_t t_access, t_proc, t_save, t_total;
// Access time.
t_access = clock();
Mat img0 = imread("img0.jpg", IMREAD_COLOR);// takes ~90ms
t_access = clock() - t_access;
// Processing time
t_proc = clock();
cvtColor(img0, img0, CV_BGR2GRAY);
blur(img0, img0, Size(9,9));// takes ~18ms
t_proc = clock() - t_proc;
// Saving time
t_save = clock();
imwrite("img1.jpg", img0);
t_save = clock() - t_save;
t_total = t_access + t_proc + t_save;
//printf("CLOCKS_PER_SEC = %d\n\n", CLOCKS_PER_SEC);
printf("(TEST 0) Total execution time\t %d cycles \t= %f ms!\n", t_total,((float)t_total)*1000./CLOCKS_PER_SEC);
printf("---->> Accessing in\t %d cycles \t= %f ms.\n", t_access,((float)t_access)*1000./CLOCKS_PER_SEC);
printf("---->> Processing in\t %d cycles \t= %f ms.\n", t_proc,((float)t_proc)*1000./CLOCKS_PER_SEC);
printf("---->> Saving in\t %d cycles \t= %f ms.\n", t_save,((float)t_save)*1000./CLOCKS_PER_SEC);
return 0;
}
Results of Execution on an i7 PC
Results of Execution on Raspberry PI (Generated OS from Buildroot)
As you can see there is a huge difference. What I need is to optimize every single detail so that this example processing step occurs in "near" real-time at in a maximum 15ms processing time instead of the 44ms. So these are my questions:
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Is there any other possibilities instead of OpenCV?
Should I use C instead of C++?
Any hardware improvements you recommend?

Well as i understand, you want to get about 30-40fps. In case of your I7: it is fast and having tone of acceleration techniques enabled default by itel. In case of raspberry pi: well, we love it but it is slow, especially for image processing program.
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
You should include some acceleration library for arm and re-compiled opencv again with those features enabled.
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Paralleling your code so it could run on 4 cores
Is there any other possibilities instead of OpenCV?
Ask your self first, what features do you need from OpenCV.
Should I use C instead of C++?
Changing language will not help you at all, stay and love C++. It is a beautiful language and very fast
Any hardware improvements you recommend?
How about other board with mali GPU supported. So you could run opencv code directly on GPU, that will boost up your speed a lot.

Related

how to run Clock-gettime correctly in Vxworks to get accurate time

I am trying to measure time take by processes in C++ program with linux and Vxworks. I have noticed that clock_gettime(CLOCK_REALTIME, timespec ) is accurate enough (resolution about 1 ns) to do the job on many Oses. For a portability matter I am using this function and running it on both Vxworks 6.2 and linux 3.7.
I ve tried to measure the time taken by a simple print:
#define <timers.h<
#define <iostream>
#define BILLION 1000000000L
int main(){
struct timespec start, end; uint32_t diff;
for(int i=0; i<1000; i++){
clock_gettime(CLOCK_REALTME, &start);
std::cout<<"Do stuff"<<std::endl;
clock_gettime(CLOCK_REALTME, &end);
diff = BILLION*(end.tv_sec-start.tv_sec)+(end.tv_nsec-start.tv_nsec);
std::cout<<diff<<std::endl;
}
return 0;
}
I compiled this on linux and vxworks. For linux results seemed logic (average 20 µs). But for Vxworks, I ve got a lot of zeros , then 5000000 ns , then a lot of zeros...
PS , for vxwroks, I runned this app on ARM-cortex A8, and results seemed random
have anyone seen the same bug before,
In vxworks, the clock resolution is defined by the system scheduler frequency. By default, this is typically 60Hz, however may be different dependant on BSP, kernel configuration, or runtime configuration.
The VxWorks kernel configuration parameters SYS_CLK_RATE_MAX and SYS_CLK_RATE_MIN define the maximum and minimum values supported, and SYS_CLK_RATE defines the default rate, applied at boot.
The actual clock rate can be modified at runtime using sysClkRateSet, either within your code, or from the shell.
You can check the current rate by using sysClkRateGet.
Given that you are seeing either 0 or 5000000ns - which is 5ms, I would expect that your system clock rate is ~200Hz.
To get greater resolution, you can increase the system clock rate. However, this may have undesired side effects, as this will increase the frequency of certain system operations.
A better method of timing code may be to use sysTimestamp which is typically driven from a high frequency timer, and can be used to perform high-res timing of short-lived activities.
I think in vxworks by default the clock resolution is 16.66ms which you can get by calling clock_getres() function. You can change the resolution by calling sysclkrateset() function(max resolution supported is 200us i guess by passing 5000 as argument to sysclkrateset function). You can then calculate the difference between two timestamps using difftime() function

Single thread programme apparently using multiple core

Question summary: all four cores used when running a single threaded programme. Why?
Details: I have written a non-parallelised programme in Xcode (C++). I was in the process of parallelising it, and wanted to see whether what I was doing was actually resulting in more cores being used. To that end I used Instruments to look at the core usage. To my surprise, while my application is single threaded, all four cores were being utilised.
To test whether it changed the performance, I dialled down the number of cores available to 1 (you can do it in Instruments, preferences) and the speed wasn't reduced at all. So (as I knew) the programme isn't parallelised in any way.
I can't find any information on what it means to use multiple cores to perform single threaded tasks. Am I reading the Instruments output wrong? Or is the single-threaded process being shunted between different cores for some reason (like changing lanes on a road instead of driving in two lanes at once - i.e. actual parallelisation)?
Thanks for any insight anyone can give on this.
EDIT with MWE (apologies for not doing this initially).
The following is C++ code that finds primes under 500,000, compiled in Xcode.
#include <iostream>
int main(int argc, const char * argv[]) {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;
int num_max = 500000;
while (num <= num_max) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num){
primes++;
std::cout << "Prime: " << num << std::endl;
}
num++;
}
end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
std::cout << "This machine calculated all " << primes << " under " << num_max << " in " << runTime << " seconds." << std::endl;
return 0;
}
This runs in 36s or thereabouts on my machine, as shown by the final out and my phone's stopwatch. When I profile it (using instruments launched from within Xcode) it gives a run-time of around 28s. The following image shows the core usage.
instruments showing core usage with all 4 cores (with hyper threading)
Now I reduce number of available cores to 1. Re-running from within the profiler (pressing the record button), it says a run-time of 29s; a picture is shown below.
instruments output with only 1 core available
That would accord with my theory that more cores doesn't improve performance for a single thread programme! Unfortunately, when I actually time the programme with my phone, the above took about 1 minute 30s, so there is a meaningful performance gain from having all cores switched on.
One thing that is really puzzling me, is that, if you leave the number of cores at 1, go back to Xcode and run the program, it again says it takes about 33s, but my phone says it takes 1 minute 50s. So changing the cores is doing something to the internal clock (perhaps).
Hopefully that describes the problem fully. I'm running on a 2015 15 inch MBP, with 2.2GHz i7 quad core processor. Xcode 7.3.1
I want to premise your answer lacks a lots of information in order to proceed an accurate diagnostic. Anyway I'll try to explain you the most common reason IHMO, supposing you application doesn't use 3-rd part component which perform in a multi-thread way.
I think that could be a result of scheduler effect. I'm going to explain what I mean.
Each core of the processor takes a process in the system and executed it for a "short" amount of time. This is the most common solution in desktop operative system.
Your process is executed on a single core for this amount of time and then stopped in order to allow other process to continue. When your same process is resumed it could be executed in another core (always one core, but a different one). So a poor precise task manager with a low resolution time could register the utilization of all cores, even if it does not.
In order to verify whether the cause could be that, I suggest you to see the amount of CPU % used in the time your application is running. Indeed in case of a single thread application the CPU should be about 1/#numberCore , in your case 25%.
If it's a release build your compiler may be vectorising parallelise your code. Also libraries you link against, say the standard library for example, may be threaded or vectorised.

windows timing drift of performancecounter C++

I am using QueryPerformanceCounter() function for measuring time in my application. This function also used in order to supply time line of my application life. Recently I have noticed that there is a drift of the time regarding other time-functions. Finally, I have written a small test to check whether the drift is real or not. (I use VS2013 compiler)
#include <Windows.h>
#include <chrono>
#include <thread>
#include <cstdio>
static LARGE_INTEGER s_freq;
using namespace std::chrono;
inline double now_WinPerfCounter()
{
LARGE_INTEGER tt;
if (TRUE != ::QueryPerformanceCounter(&tt))
{
printf("Error in QueryPerformanceCounter() - Err=%d\n", GetLastError());
return -1.;
}
return (double)tt.QuadPart / s_freq.QuadPart;
}
inline double now_WinTick64()
{
return (double)GetTickCount64() / 1000.;
}
inline double now_WinFileTime()
{
FILETIME ft;
::GetSystemTimeAsFileTime(&ft);
long long * pVal = reinterpret_cast<long long *>(&ft);
return (double)(*pVal) / 10000000. - 11644473600LL;
}
int _tmain(int argc, _TCHAR* argv[])
{
if (TRUE != ::QueryPerformanceFrequency(&s_freq))
{
printf("Error in QueryPerformanceFrequency() - Err=#d\n", GetLastError());
return -1;
}
// save all timetags at the beginning
double t1_0 = now_WinPerfCounter();
double t2_0 = now_WinTick64();
double t3_0 = now_WinFileTime();
steady_clock::time_point t4_0 = steady_clock::now();
for (int i = 0;; ++i) // forever
{
double t1 = now_WinPerfCounter();
double t2 = now_WinTick64();
double t3 = now_WinFileTime();
steady_clock::time_point t4 = steady_clock::now();
printf("%03d\t %.3lf %.3lf %.3lf %.3lf \n",
i,
t1 - t1_0,
t2 - t2_0,
t3 - t3_0,
duration_cast<nanoseconds>(t4 - t4_0).count() * 1.e-9
);
std::this_thread::sleep_for(std::chrono::seconds(10));
}
return 0;
}
The output was, confusing :
000 0.000 0.000 0.000 0.000
...
001 10.001 10.000 10.002 10.002
...
015 150.006 150.010 150.010 150.010
...
024 240.009 240.007 240.015 240.015
...
025 250.010 250.007 250.015 250.015
...
026 260.010 260.007 260.016 260.016
...
070 700.027 700.039 700.041 700.041
Why is there a difference ? It looks like one second duration is not same when using different API functions ? Also , during a day the difference is not constant ...
This is normal, affordable clocks never have infinite precision. GetSystemTime() and GetTickCount() are periodically re-calibrated from an Internet time server, time.windows.com by default. Which uses an unaffordable atomic clock to keep time. Adjustments to catch up or slow down are gradual in order to avoid giving software a heart-attack. The time they report will be off from the Earth's rotation by at most a few seconds, the periodic re-calibration limits long-term drift error.
But QPF is not calibrated, its resolution is far too high to allow these kind of adjustments to not affect the short interval measurements it is normally used for. It is derived from a frequency source available on the chipset, picked by the system builder. Subject to typical electronic part tolerances, temperature in particular affects accuracy and causes variable drift over longer periods. QPF is therefore only suitable to measure relatively short intervals.
Inevitably both clocks can't be the same and you will see them drift apart. You'll have to pick one or the other to be the "master" clock. If long term accuracy is important then you'll have to pick the calibrated source, if high resolution but not accuracy is important then QPF. If both are important then you need custom hardware, like a GPS clock.
Edit: The main reason for the drift you observe is an inaccuracy of the performance counter frequency. The system provides you with a constant value returned by QueryPerformanceFrequency. This constant frequency is only near the true frequency. It is accompanied by an offset and possible drift. Modern Platforms (Windows > 7, invariant TSC) have little offset and little drift.
Example: A 5 ppm offset would cause the time to move forward by 5 us/s or 432 ms/day faster than expected if you use the performance counter frequency to scale performance counter values to times.
General:
QueryPerformanceCounterand GetSystemTimeAsFileTime are using different resources (hardware). Modern platforms derive QueryPerformanceCounter from the CPUs timestamp counter (TSC) and GetSystemTimeAsFileTime uses PIT timer, ACPI PM timer, or HPET hardware. See Intel 64® and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2 for details.
It is unavoidable to deal with drift when the two APIs are using different hardware. However, you may extend your test code to calibrate the drift.
Frequency generating hardware is typically temperature sensitive. Therefore, the drift may vary, depending on the load. See
The Windows Timestamp Project for details.
Probably, different functions measure time in different ways.
QueryPerformanceCounter - is a high-resolution time stamps or measure time intervals.
GetSystemTimeAsFileTime - retrieves the current system date and time in UTC format.
So this is not quite correct to use this functions and compare time from them.
It's probably a bad idea to use multiple methods of the measurement; the different functions have different resolutions, you'll effectively be seeing noise from values getting rounded to their precision. And of course, a small amount of time will pass while setting t1..t4 as the functions take time to execute, so drift is unavoidable.
You may want to look at GetSystemTimePreciseAsFileTime API. It is both high-precision and can adjust periodically from external time servers.

Getting milliseconds accuracy current time in Qt

Qt documentation about QTime::currentTime() says :
Note that the accuracy depends on the accuracy of the underlying
operating system; not all systems provide 1-millisecond accuracy.
But is there any way to get this time with milliseconds accuracy in windows 7?
You can use QDateTime class and convert the current time with the appropriate format:
QDateTime::currentDateTime().toString("yyyy/MM/dd hh:mm:ss,zzz")
where 'z' corresponds to miliseconds accuracy.
you can use the functionality provided by time.h header file in C/C++.
#include <time.h>
clock_t start, end;
double cpu_time_used;
int main()
{
start = clock();
/* Do the work. */
end = clock();
cpu_time_used = ((double)(end-start)/ CLOCKS_PER_SEC);
}
Timer resolution may vary on different platforms and readings may not be accurate. If you need to get high-resolution, accurate timestamps on Windows 7, it provides QPC API:
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
GetSystemTimePreciseAsFileTime is claimed to provide system time with <1us resolution.
But that's only about accurate timestamp. If you need to actually do something with 1 ms latency (ex. handle an event), you need a RTOS, not a desktop clunker.
One common way would be to scale up whatever you are doing and do it 10-100 times in a row, that way you would be able get a more accurate time reading of whatever you are doing, by dividing the result by 10-100.
But getting millisecond precise readings of your time is pretty much useless because you don't have 100% of the cpu time, which means that your readings will have much greater variance than just 1 millisecond if the OS gives another process computing time while you are doing your actions.

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.
Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.
Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)