Speed of UMat vs Mat in DFT OpenCV TAPI - c++

I found some interesting results regarding performing the cv::dft function on cv::UMats vs cv::Mats. Essentially I found that UMats are actually much slower until images up to 4096x4096. Until then, cv::Mat consistently wins. Is this just be cause the dft is not implemented for the TAPI api and only the CV::Mat implementation? The test I ran looks something like this (I used the celero project to create the benchmark):
constexpr int num_samples = 2;
constexpr int num_iterations = 10;
constexpr int num_rows = 4096;
constexpr int num_cols = 4096;
cv::UMat a = cv::UMat(num_rows, num_cols, CV_32F);
cv::Mat b = cv::Mat(num_rows, num_cols, CV_32F);
void CreateUMat() { cv::randu(a, 0, 256); }
void CreateMat() { cv::randu(b, 0, 256); }
void DftUMat() {
CreateUMat();
cv::dft(a, a);
cv::idft(a, a, cv::DFT_SCALE | cv::DFT_INVERSE);
}
void DftMat() {
CreateMat();
cv::dft(b, b);
cv::idft(b, b, cv::DFT_SCALE | cv::DFT_INVERSE);
}
BASELINE(UMatBenchmarks, Baseline, num_samples, num_iterations) { DftUMat(); }
BENCHMARK(UMatBenchmarks, NoGPU, num_samples, num_iterations) { DftMat(); }
I got the following results:
cv::UMat iterations/sec = 4.51
cv::Mat iterations/sec = 4.70
for a smaller image, say 1024x1024 I got the following results:
cv::UMat iterations/sec = 63.21
cv::Mat iterations/sec = 85.83
From these results, you can see that there is almost no advantage to using UMat for large images sizes and there is especially no advantage on smaller images. This surprises me because I got significant speed ups with cv::matchTemplate when switching to the OpenCV TAPI. My guess is that cv::dft has not been implemented in with OpenCL, but is this truly the case? Is the DFT just a not good algorithm to offload to the GPUS? Thanks!

Related

CL_INVALID_KERNEL_ARGS when trying to write from image to image using OpenCV

I decided to try to start learning OpenCL. I spent a lot of time compiling and the like, and finally I have a Qt project with OpenCV embedded and OpenCL working. The information on the internet about the next steps is kinds scarce though. Using other stackoverflow posts, I botched together this kernel, which should swap image color channels.
This is my kernel:
__kernel void shift(
read_only image2d_t input,
float shift_x,
float shift_y,
write_only image2d_t output,
int dst_step, int dst_offset, int dst_rows, int dst_cols)
{
int2 coord = (get_global_id(1), get_global_id(0));
uint4 pixel = read_imageui(input, samplerLN, coord);
// create pixel with swapped channels
uint4 pixel2;
pixel2.s0 = pixel.s1;
pixel2.s1 = pixel.s2;
pixel2.s2 = pixel.s0;
write_imageui(output, coord, pixel2);
}
And this is how I try to run it:
//! run gpu operation
cv::ocl::Device(context.device(0));
cv::Mat imageOpenCL = cv::imread("D:\\images\\20200424_162602.jpg", cv::IMREAD_GRAYSCALE);
imageOpenCL.convertTo(imageOpenCL, CV_32F, 1.0 / 255);
cv::UMat umat_src = imageOpenCL.getUMat(cv::ACCESS_READ, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
cv::UMat umat_dst(imageOpenCL.size(), CV_32F, cv::ACCESS_WRITE, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
cv::ocl::ProgramSource program(source);
cv::ocl::Image2D imageCL(umat_src);
cv::ocl::Image2D imageCLOut(umat_dst);
float shift_x = 100.5;
float shift_y = -50.0;
cv::ocl::Kernel kernel("shift", program);
kernel.args(imageCL, shift_x, shift_y, imageCLOut);
size_t globalThreads[3] = { (size_t)imageOpenCL.cols, (size_t)imageOpenCL.rows, 1 };
//size_t localThreads[3] = { 16, 16, 1 };
bool success = kernel.run(3, globalThreads, NULL, true);
if (!success){
std::cout << "Failed running the kernel..." << std::endl;
return;
}
// Download the dst data from the device (?)
cv::Mat mat_dst = umat_dst.getMat(cv::ACCESS_READ);
cv::imshow("src", imageOpenCL);
cv::imshow("dst", mat_dst);
I'm probably copying the data wrong, but I'm not sure what to do. I also tried different types instead of CV_32F for the image, such as CV_8U and CV_8UC3.
Your kernel has 8 arguments; you are only setting 4 of them. Hence, the CL_INVALID_KERNEL_ARGS error. It does not appear you are using the last 4 arguments in the kernel; this fix seems to be to remove them from the kernel argument list.

OpenCV Mat Problem: Difference between Histogram and this loop

i am working on image processing project that i want to implement it on cuda with opencv (opencv 4.0 with cuda suport)and i am not good at c++.
for color correction between two images, i am using code from this link: (https://answers.opencv.org/question/178127/matching-colors-between-two-pictures-in-opencv/)
my goal is to implement this code on GPU. for that i tried to rewrite that code. i faced two problems:
1- Is there any Cuda implemented library for this purpose? (Same Functionality)
2- in rewriting function ((do1ChnHist)), it seams that this loop calculates 1D histogram (Is that true?) :
for (size_t p = 0; p<img.total(); p++)
{
if (mask(p) > 0)
{
uchar c = img(p);
h(c) += 1.0;
}
}
but i can't replace it with :
int histSize = 256;
float range[] = { 0, 256 }; //the upper boundary is exclusive
const float* histRange = { range };
bool uniform = false, accumulate = false;
calcHist(&img, 1, 0, Mat(), h, 1, &histSize, &histRange, uniform, accumulate);
or rewrite it with this loop (For changing Mat >> GpuMat in future. unfortunately Opencv_cuda does not support GpuMat_<>, due to that i tried to rewrite loop with Mat):
Mat h;
h = Mat::zeros(cv::Size(256, 1), CV_16U);
uchar x;
for (size_t m = 0; m < img.size().width; m++)
{
for (size_t n = 0; n < img.size().width; n++)
{
x = img.at<int>(Point(m, n));
h.at<int>(Point(int(x),0)) += 1;
}
}
because ether of two options return different answer from main loop in do1ChnHist function...
thanks...
Opencv has all the function u want
virtual void cv::cuda::TemplateMatching::match ( InputArray image,
InputArray templ,
OutputArray result,
Stream & stream = Stream::Null()
)
void cv::cuda::calcHist (InputArray src, OutputArray hist, Stream &stream=Stream::Null())
Calculates histogram for one channel 8-bit image. More...
void cv::cuda::calcHist (InputArray src, InputArray mask, OutputArray hist, Stream &stream=Stream::Null())
Calculates histogram for one channel 8-bit image confined in given mask. More...
depends, could be 1D array, and could be 2D array, depends on color. You should learn some basic image processing principle first.

Why do these histogram functions differ, and why is one nondeterministic?

NOTE: This is a homework problem and the professor explicitly forbids soliciting answers from StackOverflow, so please limit your response to the specific question I have asked and do not attempt to provide a working solution.
I am asked to implement a function that computes the histogram of a single-channel 8-bit image represented as an OpenCV Mat with type CV_U8.
In this case, the histogram uses 256 uniformly-distributed buckets. This is the reference we are intended to replicate (using OpenCV 3.4):
Mat reference;
/// Establish the number of bins
int histSize = 256;
/// Set the ranges ( for B,G,R) )
float range[] = { 0, 256 } ;
const float* histRange = { range };
bool uniform = true;
bool accumulate = false;
cv::calcHist(&bgr_planes[0], 1, 0, Mat(), reference, 1, &histSize, &histRange,
uniform, accumulate);
// reference now contains the canonical histogram of the input image's
// blue channel
I wrote the following function to calculate the histogram, which produces the correct results 45-69% of the time (p<0.05, n=66). Once when it failed, I examined the results and found no discernable pattern. All trials were conducted on the same test image.
Mat myCalcHist(const Mat& input) {
assert(input.isContinuous());
Mat res(256, 1, CV_32F);
for (const uint8_t* it = input.datastart; it != input.dataend; ++it) {
++res.at<float>(*it);
}
return res;
}
The following function, on the other hand, more closely matches OpenCV's internal implementation in that it uses the idiomatic accessors and converts the float result from an int work matrix, but in n=66 trials it did not produce the correct result a single time. Again, I found no discernable pattern in the data.
Mat myCalcHist(const Mat& input) {
Mat ires(256, 1, CV_32S);
for (int i = 0; i < input.total(); ++i) {
++ires.at<int>(input.at<uint8_t>(i));
}
Mat res(256, 1, CV_32F);
ires.convertTo(res, CV_32F);
return res;
}
Why are the results for my first implementation different than those from my second implementation, and where is nondeterminism introduced to the first implementation?
initializing the histogram matrix should work:
Mat myCalcHist(const Mat& input)
{
Mat ires = cv::Mat::zeros(256, 1, CV_32S);
for (int i = 0; i < input.total(); ++i)
{
++ires.at<int>(input.at<uint8_t>(i));
}
Mat res(256, 1, CV_32F);
ires.convertTo(res, CV_32F);
return res;
}

Faster method of accessing a channel from RGB image in OpenCV?

In my trials with images of 1409x900 and 960x696, it takes 2.5 ms on average to split channels of a RGB image using OpenCV in my 64-bit 6-core 3.2 GHz Windows machine.
vector<cv::Mat> channels;
cv::split(img, channels);
I found that this is almost similar amount of time for the other image processing (boolean operation + morphological opening).
Considering my code only uses an image of a channel from the splitting, I wonder if there is any faster way of extracting single channel from a RGB image, preferably with OpenCV.
UPDATE
As #DanMaĆĄek pointed out, there was another function mixChannels that can extract a single channel image from multi-channel. I've tested about 2000 images with the same sizes. mixChannels took about 1 ms on average. For now, I am satisfied with the result. But post your answer if you can make it faster.
cv::Mat channel(img.rows, img.cols, CV_8UC1);
int from_to[] = { sel_channel,0 };
mixChannels(&img, 1, &channel, 1, from_to, 1);
Two simple options come to mind here.
You mention that you perform this operation repeatedly on images captured from a camera. Therefore it is safe to assume that the images are always the same size.
Allocations of cv::Mat have a non-negligible overhead, so in this case it would be beneficial to reuse the channel Mats. (i.e. allocate the destination images when you receive the first frame, and then just overwrite the contents for subsequent frames)
The additional benefit of this approach is (quite likely) reducing memory fragmentation. This can become a real problem for 32bit code.
You mention that you're interested in only one specific channel (which the user may select arbitrarily). That means you could use cv::mixChannels, which gives you the flexibility in selecting what channels and how you want to extract them.
That means you can extract data for only a single channel, theoretically (depending on the implementation -- study the source code for more details) avoiding the overhead in extracting and/or copying the data for the channels you're not interested in.
Let's make a test program evaluating the 4 possible combinations of the approaches outlined above.
Variant 0: cv::split without reuse
Variant 1: cv::split with reuse
Variant 2: cv::mixChannels without reuse
Variant 3: cv::mixChannels with reuse
NB: I just use static for simplicity here, usually i'd make this member variable in a class that wraps the algorithm.
#include <opencv2/opencv.hpp>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
#define SELECTED_CHANNEL 1
cv::Mat variant_0(cv::Mat const& img)
{
std::vector<cv::Mat> channels;
cv::split(img, channels);
return channels[SELECTED_CHANNEL];
}
cv::Mat variant_1(cv::Mat const& img)
{
static std::vector<cv::Mat> channels;
cv::split(img, channels);
return channels[SELECTED_CHANNEL];
}
cv::Mat variant_2(cv::Mat const& img)
{
// NB: output Mat must be preallocated
cv::Mat channel(img.rows, img.cols, CV_8UC1);
int from_to[] = { SELECTED_CHANNEL, 0 };
cv::mixChannels(&img, 1, &channel, 1, from_to, 1);
return channel;
}
cv::Mat variant_3(cv::Mat const& img)
{
// NB: output Mat must be preallocated
static cv::Mat channel(img.rows, img.cols, CV_8UC1);
int from_to[] = { SELECTED_CHANNEL, 0 };
cv::mixChannels(&img, 1, &channel, 1, from_to, 1);
return channel;
}
template<typename T>
void timeit(std::string const& title, T f)
{
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::microseconds;
cv::Mat img(1024,1024, CV_8UC3);
cv::randu(img, 0, 256);
int32_t const STEPS(1024);
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for (uint32_t i(0); i < STEPS; ++i) {
cv::Mat result = f(img);
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(t2 - t1).count();
double t_ms(static_cast<double>(duration) / 1000.0);
std::cout << title << "\n"
<< "Total = " << t_ms << " ms\n"
<< "Iteration = " << (t_ms / STEPS) << " ms\n"
<< "FPS = " << (STEPS / t_ms * 1000.0) << "\n"
<< "\n";
}
int main()
{
for (uint8_t i(0); i < 2; ++i) {
timeit("Variant 0", variant_0);
timeit("Variant 1", variant_1);
timeit("Variant 2", variant_2);
timeit("Variant 3", variant_3);
std::cout << "--------------------------\n\n";
}
return 0;
}
Output for the second pass (so we avoid any warmup costs).
Note: Running this on i7-4930K, using OpenCV 3.1.0 (64-bit, MSVC12.0), Windows 10 -- YMMV, especially with CPUs that have AVX2
Variant 0
Total = 1518.69 ms
Iteration = 1.48309 ms
FPS = 674.267
Variant 1
Total = 359.048 ms
Iteration = 0.350633 ms
FPS = 2851.99
Variant 2
Total = 820.223 ms
Iteration = 0.800999 ms
FPS = 1248.44
Variant 3
Total = 427.089 ms
Iteration = 0.417079 ms
FPS = 2397.63
Interestingly, cv::split with reuse wins here. Feel free to edit the answer and add timings from different platforms/CPU generations (especially if the proportions differ radically).
It also seems that with my setup, none of this is parallelized quite well, so that may be another possible path at speeding this up (something like cv::parallel_for_).

opencv calcHist results are not what expected

In openCV, I have a matrix of integers (a 4000x1 Mat). Each time I read different ranges of this matrix: Mat labelsForHist = labels(Range(from,to),Range(0,1));
The size of the ranges is variable. Then I convert the labelsForHist matrix to float(because calcHist doesnt accept int values!) by using:
labelsForHist.convertTo(labelsForHistFloat, CV_32F);
After this I call calcHist with these parameters:
Mat hist;
int histSize = 4000;
float range[] = { 0, 4000 } ;
int channels[] = {0};
const float* histRange = { range };
bool uniform = true; bool accumulate = false;
calcHist(&labelsForHistFloat,1,channels,Mat(),hist,1,&histSize,&histRange,uniform,accumulate);
The results are normalized by using:
normalize(hist,hist,1,0,NORM_L1,-1,Mat());
The problem is that my histograms doesn't look like what I was expecting. Any idea on what I am doing wrong or does the problem come from other part of the code (and not calculation of histograms)?
I expect this sparse histogram:
while I get this flat histogram, for same data:
The first hist was calculated in python, but I want to do the same in c++
There is a clustering process before calculating histograms, so if there is no problem with creating histograms then deffinitly the problem comes from before that in clustering part!