openCV cv::matchTemplate running twice slower on a "better/newer" intel cpu - c++

I am using cv::matchTemplate to track a moving object in a video.
However, running the template matching of open cv with a small picture can be slower on a better/newer intel's CPU. The code snippet below run typically 2 times slower on a i9-7920x (0.28ms/match) than a i7-9700k (0.14ms/match).
#include <chrono>
#include <fstream>
#include <opencv2/opencv.hpp>
#pragma optimize("", off)
int main()
{
cv::Mat haystack;
cv::Mat needle;
cv::Mat result;
cv::Rect rect;
//https://en.wikipedia.org/wiki/Barack_Obama#/media/File:President_Barack_Obama.jpg
haystack = cv::imread("C:/President_Barack_Obama.jpg");
rect.width = 64;
rect.height = 64;
haystack = haystack(rect);
rect.width = 12;
rect.height = 12;
rect.x = 50;
rect.y = 50;
needle = haystack(rect);
auto start = std::chrono::high_resolution_clock::now();
int nbmatch = 10000;
for (int i = 0; i < nbmatch; i++) {
cv::matchTemplate(haystack, needle, result, cv::TemplateMatchModes::TM_CCOEFF_NORMED);
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << "time per match: " << (diff.count() / nbmatch) * 1000 << " ms\n";
std::this_thread::sleep_for(std::chrono::seconds(500));
}
In my real application, I noticed this:
i7-9700k: 1ms;
i7-6800k: 1.3ms;
i9-7920x: 2.8ms;
i9-9820x: 2.8ms.
Both the i9 are slower by a fair amount that could not be explained by the slight difference in clock speed.
Win 7 or 10 does not make a difference. It is compiled with Visual Studio 2019 (v142). Open CV is compiled from the source with the pre-built libraries (building it myself did not help).
Edit:
The capacity to scale the frequency seems to have an important impact. If runned single threaded the i9-7920x still run in 2.8ms if I sleep regularily but if I yield instead (cpu load of 100%) it lower to 1.9ms.
Question:
What could explain this?
Do you think it is possible to bring all processor to compute in the same range of time using cv::matchTemplate?
What could I do else to reduce my computation time?

Related

I followed a CUDA tutorial but my GPU computation time is much longer than my CPU time?

I followed the tutorial on this page but my results are terrible. The time taken is as follows:
CPU: 569
GPU: 11160
Here is my code. What is going wrong? I can't see why this code is so slow?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <chrono>
#include <iostream>
#include <math.h>
#include <stdio.h>
__global__ void addCUDA(int n, float* x, float* y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
void add(int n, float* x, float* y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main()
{
int N = 1 << 20;
float* x = new float[N];
float* y = new float[N];
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
auto t1 = std::chrono::high_resolution_clock::now();
add(N, x, y);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << duration << std::endl;
delete[] x;
delete[] y;
float* u,
float* v;
cudaMallocManaged(&u, N * sizeof(float));
cudaMallocManaged(&v, N * sizeof(float));
for (int i = 0; i < N; i++) {
u[i] = 1.0f;
v[i] = 2.0f;
}
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(u, N * sizeof(float), device, NULL);
cudaMemPrefetchAsync(v, N * sizeof(float), device, NULL);
auto t3 = std::chrono::high_resolution_clock::now();
addCUDA<<<numBlocks, blockSize>>> (N, u, v);
cudaDeviceSynchronize();
auto t4 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t4 - t3).count();
maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(v[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << duration << std::endl;
cudaFree(u);
cudaFree(v);
return 0;
}
For a so trivial operation (+ on each element) it takes way more time to send the buffers from host to gpu and to retrieve the buffer from gpu to host, than performing the actual computation.
Even if the API is very comfortable to make buffer accesses look easy and almost magic, data has to travel through the PCI-express bus...
The transfer is asynchronous here, but the computation has to wait for it to complete before actually starting; asynchronous transfer is interesting only if you have something else to do in the meantime (organise various stages of a complex computation as a pipeline for example).
If you try with another problem that requires much more computation, the buffer transfers will be amortized.
Moreover, two arrays of 1<<20 floats requires only 8MB and can fit in the cache memory of a modern CPU.
Then, after the initialisation of these two arrays, they may be already hot in cache memory and easily accessible for CPU computation.
Because the computation is a perfectly regular loop, a decent optimizing compiler will use SIMD instructions, the CPU won't mispredict branches and will perfectly prefetch the data in the various cache levels; all of this greatly increases CPU efficiency for this kind of computation.
It's not so easy to outperform a modern CPU with a GPU.
It really depends on the size and the complexity of the problem (an on the specific properties of these two pieces of hardware of course).
EDIT
As discussed in the comments, the timing method used in the cited article and the one shown in the question are very different.
In the article, nvprof uses internal counters in the GPU to measure the time spent actively computing the addCUDA() (add() in the article) function, without considering either the time it takes to obtain the two source buffers from host and to send back the resulting buffer to host.
Of course, it's fast! Because on much modern hardware (CPU or GPU) most of the time is spent accessing/transferring data rather than computing. If we measured the time spent in our CPU to perform additions only, ignoring the time spent fetching/writing data from/to cache/memory, it would not be very long either!
(Note that the CPU code in the article is not even compiled with optimisation turned on; do such timings have any meaning?)
In the code shown in the question, the timing method is quite different but much more relevant in my opinion.
The two calls to std::chrono::high_resolution_clock::now() actually consider the time spent doing all the work: sending the two source buffers, computing on them and fetching the resulting buffer.
It's the only duration that matters after all!
This way, it is fair to compare this duration to the one we obtain (with a similar method) when timing the CPU.
The fact that cudaMemPrefetchAsync() is used can be misleading because we could think that the transfer of the source buffers is excluded from the timings: it is not, and that's why we find the result disappointing compared to the article.
We launch the timer right after these two calls in order to measure the time spent in the computation, but the computation has to wait for these transfers to complete before actually starting (I would even have started the timer before these two calls).
Moreover, the call to cudaDeviceSynchronize() before stopping the timer waits for the transfer of the resulting buffer to complete in order to actually make the result available to the host.
If we used cudaDeviceSynchronize() before starting the timer, we could have excluded the two initial transfers from the timing, but what's the point of such a timing?
In conclusion, I think the timing method you used in your question is much better than the one promoted in the article since you can really compare the benefit you obtain (or not!) from one technology over the other.
For information, on my computers, with full optimisation turned on, your code gives these results:
CPU: 809 Intel(R) Xeon(R) CPU E5-2697 v2 # 2.70GHz]
GPU: 1160 NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
CPU: 157 Intel(R) Core(TM) i7-10875H CPU # 2.30GHz
GPU: 1158 NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

Writing OpenCV frames to disk in C++: is mono-thread write speed limited by anything other than disk throughput?

I'm facing what I consider a fairly odd behaviour when writin OpenCV frames to disk: I can't write to disk faster that ~20 fps, independently if I do it on my SSD or my HDD. But, and here's the thing: if I use one thread to write the first half of the data and another to write the second half, then I can write at double the speed (~40 fps).
I'm testing using the code below: two std::vectors are filled with 1920x1080 frames from my webcam, and then sent to two different threads to be written to disk. If, for example, I write 2 vectors of size 50 to disk, I can do it at an overall speed of ~40 fps. But if I only use one vector of size 100, that drops to half. How can it be? I thought I would be limited by the disk throughput, that is sufficient to write at least 30 fps, but I'm missing something and I don't know what. Is there other limit (apart from cpu) that I'm not taking into account?
#include "opencv2/opencv.hpp"
#include "iostream"
#include "thread"
#include <unistd.h>
#include <chrono>
#include <ctime>
cv::VideoCapture camera(0);
void writeFrames(std::vector<cv::Mat> &frames, std::vector<int> &compression_params, std::string dir)
{
for(size_t i=0; i<frames.size(); i++)
{
cv::imwrite(dir + std::to_string(i) + ".jpg",
frames[i], compression_params);
}
}
int main(int argc, char* argv[])
{
camera.set(cv::CAP_PROP_FRAME_WIDTH, 1920);
camera.set(cv::CAP_PROP_FRAME_HEIGHT, 1080);
camera.set(cv::CAP_PROP_FPS, 30);
std::vector<int> compression_params;
compression_params.push_back(cv::IMWRITE_JPEG_QUALITY);
compression_params.push_back(95); // [0 - 100] (100 better), default 95
size_t vecSizeA = 50;
size_t vecSizeB = 50;
std::vector<cv::Mat> framesA, framesB;
cv::Mat frame;
std::chrono::system_clock::time_point t0 = std::chrono::system_clock::now();
for(unsigned int i=0; i<vecSizeA; i++)
{
camera >> frame;
framesA.push_back(frame);
}
for(unsigned int i=0; i<vecSizeB; i++)
{
camera >> frame;
framesB.push_back(frame);
}
std::chrono::system_clock::time_point t1 = std::chrono::system_clock::now();
std::thread trA(writeFrames, std::ref(framesA), std::ref(compression_params), "/tmp/frames/A/");
std::thread trB(writeFrames, std::ref(framesB), std::ref(compression_params), "/tmp/frames/B/");
trA.join();
trB.join();
std::chrono::system_clock::time_point t2 = std::chrono::system_clock::now();
double tr = std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() / 1000.0;
double tw = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() / 1000.0;
std::cout << "Read fps: " << (vecSizeA + vecSizeB) / tr << std::endl;
std::cout << "Write fps: " << (vecSizeA + vecSizeB) / tw << std::endl;
return 0;
}
Edit: just in case it is not very clear, I'm looking for the way to achieve at least 30 fps on write speed. Disks can handle that (we wouldn't be able to record video at 30fps if that wouln't be the case), so my limitations come from my code or from something I'm missing.
Because 2 thread is reaching same function at the same time and it seems it is faster than one thread. Your threads' joins are in the same place. If you use them like this, you will get the same fps like one thread:
std::thread trA(writeFrames, std::ref(framesA), std::ref(compression_params), "/tmp/frames/A/");
trA.join();
std::thread trB(writeFrames, std::ref(framesB), std::ref(compression_params), "/tmp/frames/A/");
trB.join();
You can also check here to have more idea.
If, for example, I write 2 vectors of size 50 to disk, I can do it at an overall speed of ~40 fps. But if I only use one vector of size 100, that drops to [~20 fps]. How can it be?
in imwrite you are encoding/compressing the frames as well. so more work is being done than simply writing to the disk. that could potentially explain the speedup from using multiple threads.

OpenCV - Basic Operations - Performance Issue [in Mode: Release]

I might discovered a huge performance issue with OpenCV's own implementation of matrix multiplication / summation, and wanted to check with you guys if I maybe missing something:
In advance: All runs were done in (OpenCV's) Release Mode.
Setup:
(a) I'll do 10 million times a matrix-vector multiplication with a 3-by-3 matrix and a 3-by-1 vector. The implementation follows the code: res = mat * vec;
(b) I'll do the same with my own implementation of accessing the elements individually and then doing the multiplication process using pointer-arithmetic. [basically just multiplying out the process and writing down the equations for each row for the result vector]
I tested these variants with the compiler flags -O0, -O1, -O2, -O3, -Ofast and for OpenCV 3.1 & 3.2.
The timings are done using chrono (high_resolution_clock) on Ubuntu 16.04.
Findings:
In all cases the non-optimized method (b) outperforms the OpenCV method (a) by a factor of ~100 to ~1000.
Question:
How can that be the case? Shouldn't OpenCV be optimized for these kinds of procedures? Should I raise an issue on Github, or is there something I'm totally missing?
Code: [Ready to copy and test on your machine]
#include <chrono>
#include <iostream>
#include "opencv2/core/cvstd.hpp"
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
int main()
{
// 1. Setup:
std::vector<std::chrono::high_resolution_clock::time_point> timestamp_vec_start(2);
std::vector<std::chrono::high_resolution_clock::time_point> timestamp_vec_end(2);
std::vector<double> timestamp_vec_total(2);
cv::Mat test_mat = (cv::Mat_<float>(3,3) << 0.023, 232.33, 0.545,
22.22, 0.1123, 4.444,
0.012, 3.4521, 0.202);
cv::Mat test_vec = (cv::Mat_<float>(3,1) << 5.77,
1.20,
0.03);
cv::Mat result_1 = cv::Mat(3, 1, CV_32FC1);
cv::Mat result_2 = cv::Mat(3, 1, CV_32FC1);
cv::Mat temp_test_mat_results = cv::Mat(3, 3, CV_32FC1);
cv::Mat temp_test_vec_results = cv::Mat(3, 1, CV_32FC1);
auto ptr_test_mat_res_0 = temp_test_mat_results.ptr<float>(0);
auto ptr_test_mat_res_1 = temp_test_mat_results.ptr<float>(1);
auto ptr_test_mat_res_2 = temp_test_mat_results.ptr<float>(2);
auto ptr_test_vec_res_0 = temp_test_vec_results.ptr<float>(0);
auto ptr_test_vec_res_1 = temp_test_vec_results.ptr<float>(1);
auto ptr_test_vec_res_2 = temp_test_vec_results.ptr<float>(2);
auto ptr_res_0 = result_2.ptr<float>(0);
auto ptr_res_1 = result_2.ptr<float>(1);
auto ptr_res_2 = result_2.ptr<float>(2);
// 2. OpenCV Basic Matrix Operations:
timestamp_vec_start[0] = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10000000; ++i)
{
// factor of up to 5000 here:
// result_1 = (test_mat + test_mat + test_mat) * (test_vec + test_vec);
// factor of 30~100 here:
result_1 = test_mat * test_vec;
}
timestamp_vec_end[0] = std::chrono::high_resolution_clock::now();
timestamp_vec_total[0] = static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(timestamp_vec_end[0] - timestamp_vec_start[0]).count());
// 3. Pixel-Wise Operations:
timestamp_vec_start[1] = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10000000; ++i)
{
auto ptr_test_mat_0 = test_mat.ptr<float>(0);
auto ptr_test_mat_1 = test_mat.ptr<float>(1);
auto ptr_test_mat_2 = test_mat.ptr<float>(2);
auto ptr_test_vec_0 = test_vec.ptr<float>(0);
auto ptr_test_vec_1 = test_vec.ptr<float>(1);
auto ptr_test_vec_2 = test_vec.ptr<float>(2);
ptr_test_mat_res_0[0] = ptr_test_mat_0[0] + ptr_test_mat_0[0] + ptr_test_mat_0[0];
ptr_test_mat_res_0[1] = ptr_test_mat_0[1] + ptr_test_mat_0[1] + ptr_test_mat_0[1];
ptr_test_mat_res_0[2] = ptr_test_mat_0[2] + ptr_test_mat_0[2] + ptr_test_mat_0[2];
ptr_test_mat_res_1[0] = ptr_test_mat_1[0] + ptr_test_mat_1[0] + ptr_test_mat_1[0];
ptr_test_mat_res_1[1] = ptr_test_mat_1[1] + ptr_test_mat_1[1] + ptr_test_mat_1[1];
ptr_test_mat_res_1[2] = ptr_test_mat_1[2] + ptr_test_mat_1[2] + ptr_test_mat_1[2];
ptr_test_mat_res_2[0] = ptr_test_mat_2[0] + ptr_test_mat_2[0] + ptr_test_mat_2[0];
ptr_test_mat_res_2[1] = ptr_test_mat_2[1] + ptr_test_mat_2[1] + ptr_test_mat_2[1];
ptr_test_mat_res_2[2] = ptr_test_mat_2[2] + ptr_test_mat_2[2] + ptr_test_mat_2[2];
ptr_test_vec_res_0[0] = ptr_test_vec_0[0] + ptr_test_vec_0[0];
ptr_test_vec_res_1[0] = ptr_test_vec_1[0] + ptr_test_vec_1[0];
ptr_test_vec_res_2[0] = ptr_test_vec_2[0] + ptr_test_vec_2[0];
ptr_res_0[0] = ptr_test_mat_res_0[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_0[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_0[2]*ptr_test_vec_res_2[0];
ptr_res_1[0] = ptr_test_mat_res_1[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_1[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_1[2]*ptr_test_vec_res_2[0];
ptr_res_2[0] = ptr_test_mat_res_2[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_2[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_2[2]*ptr_test_vec_res_2[0];
}
timestamp_vec_end[1] = std::chrono::high_resolution_clock::now();
timestamp_vec_total[1] = static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(timestamp_vec_end[1] - timestamp_vec_start[1]).count());
// 4. Printout Timing Results:
std::cout << "\n\nTimings:\n\n";
std::cout << "Time spent in OpenCV's implementation: " << timestamp_vec_total[0]/1000.0 << " ms.\n";
std::cout << "Time spent in element-wise implementation: " << timestamp_vec_total[1]/1000.0 << " ms.\n\n";
std::cin.get();
return 0;
}
OpenCV is not optimized for small matrix operations.
You can reduce your overhead a little by not allocating a new Matrix for the result inside the loop by using cv::gemm
But if small matrix operations are a bottleneck for you I recommend using Eigen.
Using a quick Eigen implementation like:
Eigen::Matrix3d mat;
mat << 0.023, 232.33, 0.545,
22.22, 0.1123, 4.444,
0.012, 3.4521, 0.202;
Eigen::Vector3d vec3;
vec3 << 5.77,
1.20,
0.03;
Eigen::Vector3d result_e;
for (int i = 0; i < 10000000; ++i)
{
result_e = (mat *3 ) * (vec3 *2);
}
gives me the following numbers with VS2015 (obviously the difference might be less dramatic in GCC or Clang):
Timings:
Time spent in OpenCV's implementation: 2384.45 ms.
Time spent in element-wise implementation: 78.653 ms.
Time spent in Eigen implementation: 36.088 ms.

Efficiency of summing images using MATLAB and OpenCV

I am totally surprised by all of your answers. Thank you very much!
The bug code is showed as following:
percentage = (double)kk * 100.0 / (double)totalnum;
After I modified it to:
percentage = (double)kk * 100.0 / totalnum;
The problem is SOLVED. And this simple division consumed about 90s out of 150s. Maybe division between double and int is faster than it between doubles.
Again, thanks for all of your answers!
I'm trying to getting the average image from a set of pictures which come from a video. There are only 2 steps for this job:
Sum up all the images into a matrix.
Divide the matrix by the number of images.
I used following code in OpenCV: (C++)
Mat avIM = Mat::zeros(IMG_HEIGHT, IMG_WIDTH, CV_32FC3);
for (ii = startnum; ii <= endnum; ii += interval) {
string fullname = argv[1];
sprintf(filename, "\\%d.png", ii);
fullname.append(filename);
Mat tempIM = imread(fullname.c_str());
if (tempIM.empty()) { cout << "Can't open image!\n"; return -1; }
tempIM.convertTo(tempIM, CV_32FC3);
avIM += tempIM; //Sum up every image
++kk;
}
avIM = avIM * (double)(1.0 / kk); //get average'
And following code in MatLab: (2015a)
avIM = zeros(size(imread([im.dir,'\',num2str(startnum),'.png'])));
pointIdx = startnum:interval:endnum;
for j=pointIdx,
IM = imread([im.dir,'\',num2str(j),'.png']);
avIM = avIM + double(IM); %Sum up every image
end
avIM = uint8(round(avIM./size(pointIdx,2))); %get average
But when I run those two program on 2,100 images, OpenCV took 150.3s(Release) and MatLab took 103.1s. It really confused me that a C++ program runs slower than a MatLab script.
So what's slowing down my OpenCV program? If it's caused by my method of matrix accessing, what should I do to improve the efficiency?
Your code seems good enough, and in my tests I found it's running 10 times faster than Matlab code.
However, I show a slightly optimized code, that performs a little faster than yours.
Notes
Please note that I don't have a folder with images named as you, so I used cv::glob in C++ version, and dir in Matlab version to get the names of the images in the folder.
In my folder I have 82 small images, so the running time is obviously smaller than yours, but the relative performance should be reliable.
Execution time
Sum only Get filenames + Sum
Matlab: 0.173543 s (0.185308 s)
OpenCV #Seven Wang: 0.0145206 s (0.0155748 s)
OpenCV #Miki: 0.0128943 s (0.013333 s)
Considerations
Be sure that you're computing the running time consistently in OpenCV and Matlab.
Code
Matlab code:
tic
folder = 'D:\\SO\\temp\\old_075_6\\';
filenames = dir([folder '*.bmp']);
% Get rows and cols from 1st image
img = imread([folder name]);
S = zeros(size(img));
for ii = 1 : length(filenames)
name = filenames(ii).name;
currentImage = imread([folder name]);
S = S + double(currentImage);
end
S = uint8(round(S / length(filenames)));
toc
C++ code:
#include <opencv2\opencv.hpp>
#include <vector>
#include <iostream>
int main()
{
double ticLoad = double(cv::getTickCount());
std::string folder = "D:\\SO\\temp\\old_075_6\\*.bmp";
std::vector<cv::String> filenames;
cv::glob(folder, filenames);
int rows, cols;
{
// Just load the first image to get rows and cols
cv::Mat3b img = cv::imread(filenames[0]);
rows = img.rows;
cols = img.cols;
}
/*{
double tic = double(cv::getTickCount());
cv::Mat3d S(rows, cols, 0.0);
for (const auto& name : filenames)
{
cv::Mat currentImage = cv::imread(name);
currentImage.convertTo(currentImage, CV_64F);
S += currentImage;
}
S = S * double(1.0 / filenames.size());
cv::Mat3b avg;
S.convertTo(avg, CV_8U);
double toc = double(cv::getTickCount());
double timeLoad = (toc - ticLoad) / cv::getTickFrequency();
double time = (toc - tic) / cv::getTickFrequency();
std::cout << "#Seven Wang: " << time << " s (" << timeLoad << " s)" << std::endl;
}*/
{
double tic = double(cv::getTickCount());
cv::Mat3d S(rows, cols, 0.0);
cv::Mat3b currentImage;
for (const auto& name : filenames)
{
currentImage = cv::imread(name);
cv::add(S, currentImage, S, cv::noArray(), CV_64F);
}
S /= filenames.size();
cv::Mat3b avg;
S.convertTo(avg, CV_8U);
double toc = double(cv::getTickCount());
double timeLoad = (toc - ticLoad) / cv::getTickFrequency();
double time = (toc - tic) / cv::getTickFrequency();
std::cout << "#Miki: " << time << " s (" << timeLoad << " s)" << std::endl;
}
getchar();
return 0;
}
One point that drew my attention is the type "CV_32FC3". Are you specifically preferring that 32 bit float matrix and are you sure Matlab as well gets the pixel values the same way?
Because you have that extra step
tempIM.convertTo(tempIM, CV_32FC3);
in your Cpp code, where Matlab directly operates as soon as it retrieves the image without any conversion, which might be slowing down your cpp code. Furthermore, if Matlab is not getting the image in float values, that might be contributing the speed difference as float point arithmetics is a harder task for CPU to handle compared to integers.

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
}
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
{
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
}
else {
groupSize = ceil( rows / threads );
}
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
}
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
}
virtual void operator()(const cv::Range& range) const
{
for(int r = range.start; r < range.end; r++)
{
T* Ip = nullptr;
cv::Range ran = ranges.at(r);
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
else {
for(int i = ran.start; i < ran.end; ++i)
{
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
}
}
}
};
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical