Dividing cv::Mat by a number using integer division - c++

In OpenCV if cv::Mat (CV_8U) was divided by a number (int) the result will be rounded to the nearest number for example:
cv::Mat temp(1, 1, CV_8UC1, cv::Scalar(5));
temp /= 3;
std::cout <<"OpenCV Integer Division:" << temp;
std::cout << "\nNormal Integer Division:" << 5 / 3;
The result is:
OpenCV Integer Division: 2
Normal Integer Division: 1
It is obvious that OpenCV does not use integer division even if the type of the cv::Mat is CV_8U.
My questions are:
Why? Is not supposed for integers to be divided as integers. Why this strange behaviour of OpenCV?
Can I obtain integer division without iterating pixel by pixel and dividing it?
My current solution is:
for (size_t r = 0; r < temp.rows; r++){
auto row_ptr = temp.ptr<uchar>(r);
for (size_t c = 0; c < temp.cols; c++){
row_ptr[c] /= 3;
}
}

firstly : the overloaded operator for Division does the operation by converting the elements of matrix into double. it originally uses multiplication operator as: Mat / a =Mat * (1/a).
secondly : a very easy way exists to do this by one small for loop:
for(int i=0;i<temp.total();i++)
((unsigned char*)temp.data)[i]/=3;

The solution I used to solve it is: (depending on #Afshine answer and #Miki comment):
if (frame.isContinuous()){
for(int i = 0; i < frame.total(); i++){
frame.data[i]/=3;
}
}
else{
for (size_t r = 0; r < frame.rows; r++){
auto row_ptr = frame.ptr<uchar>(r);
for (size_t c = 0; c < 3 * frame.cols; c++){
row_ptr[c] /= 3;
}
}
}

Related

What is wrong with my 2D Array Gaussian Blur function in C++?

I am making a simple Gaussian blur function for a 2D array that is supposed to represent an image. The function just prints out the array values at the end (no actual image processing going on here). I was pretty sure that I had implemented everything correct, but the values I am getting for (N=3, sigma=1.5) are much lower than expected based on this calculator: http://dev.theomader.com/gaussian-kernel-calculator/
I am following this equation:
void gaussian_filter(int N, double sigma) {
double k[N][N];
for(int i=0; i<N; i++) { //Initialize kernal to 0
for(int j=0; j<N; j++) {
k[i][j] = 0;
}
}
double sum = 0.0; //There is an issue somewhere in this block of code
int change = (N/2);
double r, s = change * sigma * sigma;
for (int x = -change; x <= change; x++) {
for(int y = -change; y <= change; y++) {
r = sqrt(x*x + y*y);
k[x + change][y + change] = (exp(-(r*r)/s))/(M_PI * s);
sum += k[x + change][y + change];
}
}
for(int i = 0; i < N; ++i) { //Normalize
for(int j = 0; j < N; ++j) {
k[i][j] /= sum;
}
}
for(int i = 0; i < N; ++i) { //Print out array
for (int j = 0; j < N; ++j)
cout<<k[i][j]<<"\t";
}
cout<<endl;
}
}
Here is the expected output for N=3 and Sigma=1.5
Here is the current broken output for N=3 and Sigma=1.5
Why does s depend on change? I think you should do:
double r, s = 2 * sigma * sigma;
// instead of
// double r, s = change * sigma * sigma;
That website computes Gaussian kernels in an unorthodox manner:
The weights are calculated by numerical integration of the continuous gaussian distribution over each discrete kernel tap.
That is, it samples a continuous Gaussian kernel that has been convolved with a uniform (“box”) filter of 1 pixel wide. The resulting Gaussian is wider than advertised. I advise against this method.
The proper way to create a Gaussian kernel is to just sample the Gaussian function at given integer locations, for example x = [-3, -2, -1, 0, 1, 2, 3].
Do note that a 3-pixel kernel is not wide enough to represent a Gaussian. It is important to sample the tail of the curve, without it, the kernel doesn’t have the good properties of the Gaussian kernel. I recommend sampling up to 3 sigma to each side, leading to 2*ceil(3*sigma)+1 pixels. 2 sigma is the bare minimum, useful only when speed is more important than good results.
Do also note that the Gaussian is separable, you can apply two 1D kernels in succession, rather than a single 2D kernel. For the 9x9 kernel you get for sigma=1.5, this translates to 9+9=18 multiplications and additions, compared to 9x9=81 for the 2D kernel. This is a significant saving!

I would like to improve the performance of this code using AVX

I profiled my code and the most expensive part of the code is the loop included in the post. I want to improve the performance of this loop using AVX. I have tried manually unrolling the loop and, while that does improve performance, the improvements are not satisfactory.
int N = 100000000;
int8_t* data = new int8_t[N];
for(int i = 0; i< N; i++) { data[i] = 1 ;}
std::array<float, 10> f = {1,2,3,4,5,6,7,8,9,10};
std::vector<float> output(N, 0);
int k = 0;
for (int i = k; i < N; i = i + 2) {
for (int j = 0; j < 10; j++, k = j + 1) {
output[i] += f[j] * data[i - k];
output[i + 1] += f[j] * data[i - k + 1];
}
}
Could I have some guidance on how to approach this.
I would assume that data is a large input array of signed bytes, and f is a small array of floats of length 10, and output is the large output array of floats. Your code goes out of bounds for the first 10 iterations by i, so I will start i from 10 instead. Here is a clean version of the original code:
int s = 10;
for (int i = s; i < N; i += 2) {
for (int j = 0; j < 10; j++) {
output[i] += f[j] * data[i-j-1];
output[i+1] += f[j] * data[i-j];
}
}
As it turns out, processing two iterations by i does not change anything, so we simplify it further to:
for (int i = s; i < N; i++)
for (int j = 0; j < 10; j++)
output[i] += f[j] * data[i-j-1];
This version of code (along with declarations of input/output data) should have been present in the question itself, without others having to clean/simplify the mess.
Now it is obvious that this code applies one-dimensional convolution filter, which is a very common thing in signal processing. For instance, it can by computed in Python using numpy.convolve function. The kernel has very small length 10, so Fast Fourier Transform won't provide any benefits compared to bruteforce approach. Given that the problem is well-known, you can read a lot of articles on vectorizing small-kernel convolution. I will follow the article by hgomersall.
First, let's get rid of reverse indexing. Obviously, we can reverse the kernel before running the main algorithm. After that, we have to compute the so-called cross-correlation instead of convolution. In simple words, we move the kernel array along the input array, and compute the dot product between them for every possible offset.
std::reverse(f.data(), f.data() + 10);
for (int i = s; i < N; i++) {
int b = i-10;
float res = 0.0;
for (int j = 0; j < 10; j++)
res += f[j] * data[b+j];
output[i] = res;
}
In order to vectorize it, let's compute 8 consecutive dot products at once. Recall that we can pack eight 32-bit float numbers into one 256-bit AVX register. We will vectorize the outer loop by i, which means that:
The loop by i will be advanced by 8 every iteration.
Every value inside the outer loop turns into a 8-element pack, such that k-th element of the pack holds this value for (i+k)-th iteration of the outer loop from the scalar version.
Here is the resulting code:
//reverse the kernel
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]); //every component will have same value
//note: you have to compute the last 16 values separately!
for (size_t i = s; i + 16 <= N; i += 8) {
int b = i-10;
__m256 res = _mm256_setzero_ps();
for (size_t j = 0; j < 10; j++) {
//load: data[b+j], data[b+j+1], data[b+j+2], ..., data[b+j+15]
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]);
//convert first 8 bytes of loaded 16-byte pack into 8 floats
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes));
//compute res = res + floats * revKernel[j] elementwise
res = _mm256_fmadd_ps(revKernel[j], floats, res);
}
//store 8 values packed in res into: output[i], output[i+1], ..., output[i+7]
_mm256_storeu_ps(&output[i], res);
}
For 100 millions of elements, this code takes about 120 ms on my machine, while the original scalar implementation took 850 ms. Beware: I have Ryzen 1600 CPU, so results on Intel CPUs may be somewhat different.
Now if you really want to unroll something, the inner loop by 10 kernel elements is the perfect place. Here is how it is done:
__m256 revKernel[10];
for (size_t i = 0; i < 10; i++)
revKernel[i] = _mm256_set1_ps(f[9-i]);
for (size_t i = s; i + 16 <= N; i += 8) {
size_t b = i-10;
__m256 res = _mm256_setzero_ps();
#define DOIT(j) {\
__m128i bytes = _mm_loadu_si128((__m128i*)&data[b+j]); \
__m256 floats = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(bytes)); \
res = _mm256_fmadd_ps(revKernel[j], floats, res); \
}
DOIT(0);
DOIT(1);
DOIT(2);
DOIT(3);
DOIT(4);
DOIT(5);
DOIT(6);
DOIT(7);
DOIT(8);
DOIT(9);
_mm256_storeu_ps(&output[i], res);
}
It takes 110 ms on my machine (slightly better that the first vectorized version).
The simple copy of all elements (with conversion from bytes to floats) takes 40 ms for me, which means that this code is not memory-bound yet, and there is still some room for improvement left.

Discrete Fourier Transform implementation gives different result than OpenCV DFT

We have implemented DFT and wanted to test it with OpenCV's implementation. The results are different.
our DFT's results are in order from smallest to biggest, whereas OpenCV's results are not in any order.
the first (0th) value is the same for both calculations, as in this case, the complex part is 0 (since e^0 = 1, in the formula). The other values are different, for example OpenCV's results contain negative values, whereas ours does not.
This is our implementation of DFT:
// complex number
std::complex<float> j;
j = -1;
j = std::sqrt(j);
std::complex<float> result;
std::vector<std::complex<float>> fourier; // output
// this->N = length of contour, 512 in our case
// foreach fourier descriptor
for (int n = 0; n < this->N; ++n)
{
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}
and this is how we calculate the DFT with OpenCV:
std::vector<std::complex<float>> fourierCV; // output
cv::dft(std::vector<float>(centroidDistance, centroidDistance + this->N), fourierCV, cv::DFT_SCALE | cv::DFT_COMPLEX_OUTPUT);
The variable centroidDistance is calculated in a previous step.
Note: please avoid answers saying use OpenCV instead of your own implementation.
You forgot to initialise result for each iteration of n:
for (int n = 0; n < this->N; ++n)
{
result = 0.0f; // initialise `result` to 0 here <<<
// Summation in formula
for (int t = 0; t < this->N; ++t)
{
result += (this->centroidDistance[t] * std::exp((-j*PI2 *((float)n)*((float)t)) / ((float)N)));
}
fourier.push_back((1.0f / this->N) * result);
}

How to access the RGB values in Opencv?

I am confused about the use of number of channels.
Which one is correct of the following?
// roi is the image matrix
for(int i = 0; i < roi.rows; i++)
{
for(int j = 0; j < roi.cols; j+=roi.channels())
{
int b = roi.at<cv::Vec3b>(i,j)[0];
int g = roi.at<cv::Vec3b>(i,j)[1];
int r = roi.at<cv::Vec3b>(i,j)[2];
cout << r << " " << g << " " << b << endl ;
}
}
Or,
for(int i = 0; i < roi.rows; i++)
{
for(int j = 0; j < roi.cols; j++)
{
int b = roi.at<cv::Vec3b>(i,j)[0];
int g = roi.at<cv::Vec3b>(i,j)[1];
int r = roi.at<cv::Vec3b>(i,j)[2];
cout << r << " " << g << " " << b << endl ;
}
}
the second one is correct,
the rows and cols inside the Mat represents the number of pixels,
while the channel has nothing to do with the rows and cols number.
and CV use BGR by default, so assuming the Mat is not converted to RGB then the code is correct
reference, personal experience, OpenCV docs
A quicker way to get color components from an image is to have the image represented as an IplImage structure and then make use of the pixel size and number of channels to iterate through it using pointer arithmetic.
For example, if you know that your image is a 3-channel image with 1 byte per pixel and its format is BGR (the default in OpenCV), the following code will get access to its components:
(In the following code, img is of type IplImage.)
for (int y = 0; y < img->height; y++) {
for(int x = 0; x < img->width; x++) {
uchar *blue = ((uchar*)(img->imageData + img->widthStep*y))[x*3];
uchar *green = ((uchar*)(img->imageData + img->widthStep*y))[x*3+1];
uchar *red = ((uchar*)(img->imageData + img->widthStep*y))[x*3+2];
}
}
For a more flexible approach, you can use the CV_IMAGE_ELEM macro defined in types_c.h:
/* get reference to pixel at (col,row),
for multi-channel images (col) should be multiplied by number of channels */
#define CV_IMAGE_ELEM( image, elemtype, row, col ) \
(((elemtype*)((image)->imageData + (image)->widthStep*(row)))[(col)])
I guess the 2nd one is correct, nevertheless it is very time consuming to get the data like that.
A quicker method would be to use the IplImage* data structure and increment the address pointed with the size of the data contained in roi...

Tiny numbers in place of zero?

I have been making a matrix class (as a learning exercise) and I have come across and issue whilst testing my inverse function.
I input a arbitrary matrix as such:
2 1 1
1 2 1
1 1 2
And got it to calculate the inverse and I got the correct result:
0.75 -0.25 -0.25
-0.25 0.75 -0.25
-0.25 -0.25 0.75
But when I tried multiplying the two together to make sure I got the identity matrix I get:
1 5.5111512e-017 0
0 1 0
-1.11022302e-0.16 0 1
Why am I getting these results? I would understand if I was multiplying weird numbers where I could understand some rounding errors but the sum it's doing is:
2 * -0.25 + 1 * 0.75 + 1 * -0.25
which is clearly 0, not 5.111512e-017
If I manually get it to do the calculation; eg:
std::cout << (2 * -0.25 + 1 * 0.75 + 1 * -0.25) << "\n";
I get 0 as expected?
All the numbers are represented as doubles.
Here's my multiplcation overload:
Matrix operator*(const Matrix& A, const Matrix& B)
{
if(A.get_cols() == B.get_rows())
{
Matrix temp(A.get_rows(), B.get_cols());
for(unsigned m = 0; m < temp.get_rows(); ++m)
{
for(unsigned n = 0; n < temp.get_cols(); ++n)
{
for(unsigned i = 0; i < temp.get_cols(); ++i)
{
temp(m, n) += A(m, i) * B(i, n);
}
}
}
return temp;
}
throw std::runtime_error("Bad Matrix Multiplication");
}
and the access functions:
double& Matrix::operator()(unsigned r, unsigned c)
{
return data[cols * r + c];
}
double Matrix::operator()(unsigned r, unsigned c) const
{
return data[cols * r + c];
}
Here's the function to find the inverse:
Matrix Inverse(Matrix& M)
{
if(M.rows != M.cols)
{
throw std::runtime_error("Matrix is not square");
}
int r = 0;
int c = 0;
Matrix augment(M.rows, M.cols*2);
augment.copy(M);
for(r = 0; r < M.rows; ++r)
{
for(c = M.cols; c < M.cols * 2; ++c)
{
augment(r, c) = (r == (c - M.cols) ? 1.0 : 0.0);
}
}
for(int R = 0; R < augment.rows; ++R)
{
double n = augment(R, R);
for(c = 0; c < augment.cols; ++c)
{
augment(R, c) /= n;
}
for(r = 0; r < augment.rows; ++r)
{
if(r == R) { continue; }
double a = augment(r, R);
for(c = 0; c < augment.cols; ++c)
{
augment(r, c) -= a * augment(R, c);
}
}
}
Matrix inverse(M.rows, M.cols);
for(r = 0; r < M.rows; ++r)
{
for(c = M.cols; c < M.cols * 2; ++c)
{
inverse(r, c - M.cols) = augment(r, c);
}
}
return inverse;
}
Please read this paper: What Every Computer Scientist Should Know About Floating-Point Arithmetic
You've got numbers like 0.250000000000000005 in your inverted matrix, they're just rounded for display so you see nice little round numbers like 0.25.
You shouldn't have any problems with these numbers, since with this particular matrix the inverse is all power of 2's and may be represented accurately. In general, operations on floating point numbers introduce small errors that may accumulate and the results may be surprising.
In your case, I'm pretty sure the inverse is inaccurate and you're just displaying the first few digits. I.e., it isn't exactly 0.25 (=1/4), 0.75 (=3/4), etc.
You're always going to run into floating point rounding errors like this, especially when working with numbers that do not have exact binary representations (i.e., your numbers are not equal to 2^(N) or 1/(2^N), where N is some integer value).
That being said, there are a number of ways to increase the precision of your results, and you may want to-do a google search for numerically stable gaussian elimination algorithms using fixed-precision floating point values.
You can also, if you are willing to take a speed hit, incorporate an inifinite-precision math library that uses rational numbers, and if you take that choice, just avoid the use of roots which can create irrational numbers. There are a number of libraries out there that can help you with the use of rational numbers, such as GMP. You can also make a rational class yourself, although beware it's relatively easy to overflow the results of multiple math operations if you are only using unsigned 64-bit values along with an extra sign-flag variable for the components of your rational numbers. That's where GMP, with it's unlimited-length integer string objects comes in handy.
It's just simple floating point error. Even doubles on computers aren't 100% accurate. There just is no way to 100% accurately represent a base-10 decimal number in binary with a finite number of bits.