Best way to indexing a matrix in opencv - c++

Let say, A and B are matrices of the same size.
In Matlab, I could use simple indexing as below.
idx = A>0;
B(idx) = 0
How can I do this in OpenCV? Should I just use
for (i=0; ... rows)
for(j=0; ... cols)
if (A.at<double>(i,j)>0) B.at<double>(i,j) = 0;
something like this? Is there a better (faster and more efficient) way?
Moreover, in OpenCV, when I try
Mat idx = A>0;
the variable idx seems to be a CV_8U matrix (not boolean but integer).

You can easily convert this MATLAB code:
idx = A > 0;
B(idx) = 0;
// same as
B(A>0) = 0;
to OpenCV as:
Mat1d A(...)
Mat1d B(...)
Mat1b idx = A > 0;
B.setTo(0, idx) = 0;
// or
B.setTo(0, A > 0);
Regarding performance, in C++ it's usually faster (it depends on the enabled optimizations) to work on raw pointers (but is less readable):
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
Also note that in OpenCV there isn't any boolean matrix, but it's a CV_8UC1 matrix (aka a single channel matrix of unsigned char), where 0 means false, and any value >0 is true (typically 255).
Evaluation
Note that this may vary according to optimization enabled with OpenCV. You can test the code below on your PC to get accurate results.
Time in ms:
my results my results #AdrienDescamps
(OpenCV 3.0 No IPP) (OpenCV 2.4.9)
Matlab : 13.473
C++ Mask: 640.824 5.81815 ~5
C++ Loop: 5.24414 4.95127 ~4
Note: I'm not entirely sure about the performance drop with OpenCV 3.0, so I just remark: test the code below on your PC to get accurate results.
As #AdrienDescamps stated in comments:
It seems that the performance drop with OpenCV 3.0 is related to the OpenCL option, that is now enabled in the comparison operator.
C++ Code
#include <opencv2/opencv.hpp>
#include <iostream>
using namespace std;
using namespace cv;
int main()
{
// Random initialize A with values in [-100, 100]
Mat1d A(1000, 1000);
randu(A, Scalar(-100), Scalar(100));
// B initialized with some constant (5) value
Mat1d B(A.rows, A.cols, 5.0);
// Operation: B(A>0) = 0;
{
// Using mask
double tic = double(getTickCount());
B.setTo(0, A > 0);
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Mask: " << toc << endl;
}
{
// Using for loop
double tic = double(getTickCount());
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Loop: " << toc << endl;
}
getchar();
return 0;
}
Matlab Code
% Random initialize A with values in [-100, 100]
A = (rand(1000) * 200) - 100;
% B initialized with some constant (5) value
B = ones(1000) * 5;
tic
B(A>0) = 0;
toc
UPDATE
OpenCV 3.0 uses IPP optimization in the function setTo. If you have that enabled (you can check with cv::getBuildInformation()), you'll have a faster computation.

The answer of Miki is very good, but i just want to add some clarification about the performance problem to avoid any confusion.
It is true that the best way to implement an image filter (or any algorithm) with OpenCV is to use the raw pointers, as shown in the second C++ example of Miki (C++ Loop).
Using the at function is also correct, but significantly slower.
However, most of the time, you don't need to worry about that, and you can simply use the high level functions of OpenCV (first example of Miki , C++ Mask). They are well optimized, and will usually be almost as fast as a low level loop on pointers, or even faster.
Of course, there are exceptions (we just found one), and you should always test for your specific problem.
Now, regarding this specific problem :
The example here where the high level function was much slower (100x slower) than the low level loop is NOT a normal case, as it is demonstrated by the timings with other version/configuration of OpenCV, that are much lower.
The problem seems to be that when OpenCV3.0 is compiled with OpenCL, there is a huge overhead the first time a function that uses OpenCL is called. The simplest solution is to disable OpenCL at compile time, if you use OpenCV3.0 (see also here for other possible solutions if you are interested).

Related

Opencv obatin certain pixel RGB value based on mask

My title may not be clear enough, but please look carefully on the following description.Thanks in advance.
I have a RGB image and a binary mask image:
Mat img = imread("test.jpg")
Mat mask = Mat::zeros(img.rows, img.cols, CV_8U);
Give some ones to the mask, assume the number of ones is N. Now the nonzero coordinates are known, based on these coordinates, we can surely obtain the corresponding pixel RGB value of the origin image.I know this can be accomplished by the following code:
Mat colors = Mat::zeros(N, 3, CV_8U);
int counter = 0;
for (int i = 0; i < mask.rows; i++)
{
for (int j = 0; j < mask.cols; j++)
{
if (mask.at<uchar>(i, j) == 1)
{
colors.at<uchar>(counter, 0) = img.at<Vec3b>(i, j)[0];
colors.at<uchar>(counter, 1) = img.at<Vec3b>(i, j)[1];
colors.at<uchar>(counter, 2) = img.at<Vec3b>(i, j)[2];
counter++;
}
}
}
And the coords will be as follows:
enter image description here
However, this two layer of for loop costs too much time. I was wondering if there is a faster method to obatin colors, hope you guys can understand what I was trying to convey.
PS:If I can use python, this can be done in only one sentence:
colors = img[mask == 1]
The .at() method is the slowest way to access Mat values in C++. Fastest is to use pointers, but best practice is an iterator. See the OpenCV tutorial on scanning images.
Just a note, even though Python's syntax is nice for something like this, it still has to loop through all of the elements at the end of the day---and since it has some overhead before this, it's de-facto slower than C++ loops with pointers. You necessarily need to loop through all the elements regardless of your library, you're doing comparisons with the mask for every element.
If you are flexible with using any other open source library using C++, try Armadillo. You can do all linear algebra operations with it and also, you can reduce above code to one line(similar to your Python code snippet).
Or
Try findNonZero()function and find all coordinates in image containing non-zero values. Check this: https://stackoverflow.com/a/19244484/7514664
Compile with optimization enabled, try profiling this version and tell us if it is faster:
vector<Vec3b> colors;
if (img.isContinuous() && mask.isContinuous()) {
auto pimg = img.ptr<Vec3b>();
for (auto pmask = mask.datastart; pmask < mask.dataend; ++pmask, ++pimg) {
if (*pmask)
colors.emplace_back(*pimg);
}
}
else {
for (int r = 0; r < img.rows; ++r) {
auto prowimg = img.ptr<Vec3b>(r);
auto prowmask = img.ptr(r);
for (int c = 0; c < img.cols; ++c) {
if (prowmask[c])
colors.emplace_back(prowimg[c]);
}
}
}
If you know the size of colors, reserve the space for it beforehand.

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
}
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
{
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
}
else {
groupSize = ceil( rows / threads );
}
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
}
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
}
virtual void operator()(const cv::Range& range) const
{
for(int r = range.start; r < range.end; r++)
{
T* Ip = nullptr;
cv::Range ran = ranges.at(r);
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
else {
for(int i = ran.start; i < ran.end; ++i)
{
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
}
}
}
};
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

Why can't I get a working 2-D FFT under Visual Studio 2013 using FFTW or AMPFFT?

I've been working with 2D FFTs in a project of mine, and have been unable to get correct results using two different FFT libraries. At first I assumed I was using them wrong, but upon comparing against MATLAB and Linux GCC reference implementations, it now seems something sinister is going on with my compiler (MSVC 2013 express).
My test case is as follows:
256x256 complex to real IFFT, with a single bin at 255 (0,255 for X,Y notation) set to 10000.
Using AMPFFT, I get the following 2D transform:
And with FFTW, I get the following 2D transform:
As you can see, the AMPFFT version is sort of "almost" correct, but has this weird, every sample banding in it, and the FFTW version is just all over the place and out to lunch.
I took the output of the two different test versions and compared them to MATLAB (technically octave, which uses FFTW under the hood). I also ran the same test case for FFTW under Linux with GCC. Here is a slice from that set of tests of the 127th row (the row number technically doesn't matter since with my choice of bins all rows should be identical):
In this example, the octave and Linux implementations represent the correct result and follow the red line (octave was plotted black, the Linux in red, and it agreed completely with the octave). The FFTW under MSVC is plotted blue, and the AMP FFT output is plotted in magenta. As you can see, again the AMPFFT version seems almost close, but has this weird high frequency ripple in it, and the FFTW under MSVC is just a mess, with this weird "packeted" look to it.
At this stage I can only point the finger at Visual Studio, but I have no idea what's going on or how to fix it.
Here are my two test programs:
FFTW test:
//fftwtest.cpp
//2 dimensional complex-to-real inverse FFT test.
//Produces a 256 x 256 real-valued matrix that is loadable by octave/MATLAB
#include <fstream>
#include <iostream>
#include <complex>
#include <fftw3.h>
int main(int argc, char** argv)
{
int FFTSIZE = 256;
std::complex<double>* cpxArray;
std::complex<double>* fftOut;
// cpxArray = new std::complex<double>[FFTSIZE * FFTSIZE];
//fftOut = new double[FFTSIZE * FFTSIZE];
fftOut = (std::complex<double>*)fftw_alloc_complex(FFTSIZE*FFTSIZE);
cpxArray = (std::complex<double>*)fftw_alloc_complex(FFTSIZE * FFTSIZE);
for(int i = 0; i < FFTSIZE * FFTSIZE; i++) cpxArray[i] = 0;
cpxArray[255] = std::complex<double>(10000, 0);
fftw_plan p = fftw_plan_dft_2d(FFTSIZE, FFTSIZE, (fftw_complex*)cpxArray, (fftw_complex*)fftOut, FFTW_BACKWARD, FFTW_DESTROY_INPUT | FFTW_ESTIMATE);
fftw_execute(p);
std::ofstream debugDump("debugdumpFFTW.txt");
for(int j = 0; j < FFTSIZE; j++)
{
for(int i = 0; i < FFTSIZE; i++)
{
debugDump << " " << fftOut[j * FFTSIZE + i].real();
}
debugDump << std::endl;
}
debugDump.close();
}
AMPFFT test:
//ampffttest.cpp
//2 dimensional complex-to-real inverse FFT test.
//Produces a 256 x 256 real-valued matrix that is loadable by octave/MATLAB
#include <amp_fft.h>
#include <fstream>
#include <iostream>
int main(int argc, char** argv)
{
int FFTSIZE = 256;
std::complex<float>* cpxArray;
float* fftOut;
cpxArray = new std::complex<float>[FFTSIZE * FFTSIZE];
fftOut = new float[FFTSIZE * FFTSIZE];
for(size_t i = 0; i < FFTSIZE * FFTSIZE; i++) cpxArray[i] = 0;
cpxArray[255] = std::complex<float>(10000, 0);
concurrency::extent<2> e(FFTSIZE, FFTSIZE);
std::cout << "E[0]: " << e[0] << " E[1]: " << e[1] << std::endl;
fft<float, 2> m_fft(e);
concurrency::array<float, 2> outpArray(concurrency::extent<2>(FFTSIZE, FFTSIZE));
concurrency::array<std::complex<float>, 2> inpArray(concurrency::extent<2>(FFTSIZE, FFTSIZE), cpxArray);
m_fft.inverse_transform(inpArray, outpArray);
std::vector<float> outVec = outpArray;
std::copy(outVec.begin(), outVec.end(), fftOut);
std::ofstream debugDump("debugdump.txt");
for(int j = 0; j < FFTSIZE; j++)
{
for(int i = 0; i < FFTSIZE; i++)
{
debugDump << " " << fftOut[j * FFTSIZE + i];
}
debugDump << std::endl;
}
}
Both of these were compiled with stock settings on MSVC 2013 for a win32 console app, and the FFTW test was also run under Centos 6.4 with GCC 4.4.7. Both FFTW tests use FFTW version 3.3.4, and as an aside both complex to real and complex to complex plans were tested (with identical results).
Does anyone have even the slightest clue on Visual Studio compiler settings I could try to fix this problem?
Looking at the MSVC FFTW output in blue, it appears to be a number of sines multiplied. That is to say, there's a sine with period 64 or so, and a sine with period 4, and possibly yet another one with a similar frequency to produce a beat.
That basically means the MSVC version had at least two non-zero inputs. I suspect the cause is the typecasting, as you write a fftw_complex object through a std::complex<double> type.

Implementing FFT low-pass filter in C with FFTW

I am trying to create a very simple C++ program that given an argument in range [0-100] applies a low-pass filter to a grayscale image that should "compress" it proprotionally to the value of the given argument.
I am using the FFTW library.
I have some doubts about how I define the frequency threshold, cut. Is there any more effective way to define such value?
//fftw_complex *fft
//double[] magnitude
// . . .
int percent = 100;
if (percent < 0 || percent > 100) {
cerr << "Compression rate must be a value between 0 and 100." << endl;
return -1;
}
double cut =(double)(w*h) * ((double)percent / (double)100);
for (i = 0; i < (w * h); i++) {
magnitude[i] = sqrt(pow(fft[i][0], 2.0) + pow(fft[i][1], 2.0));
if (magnitude[i] < cut) {
fft[i][0] = 0.0;
fft[i][1] = 0.0;
}
}
Update1:
I've changed my code to this, but again I'm not sure this is a proper way to filter frequencies. The image is surely compressed, but non-square images are messed up and setting compression to 100% isn't the real maximum compression available (I can go up to ~140%).
Here you can find an image of what I see now.
int cX = w/2;
int cY = h/2;
cout<<"TEST "<<((double)percent/(double)100)*h<<endl;
for(i = 0; i<(w*h);i++){
int row = i/s;
int col = i%s;
int distance = sqrt((col-cX)*(col-cX)+(row-cY)*(row-cY));
if(distance<((double)percent/(double)100)*min(cX,cY)){
fft[i][0] = 0.0;
fft[i][1] = 0.0;
}
}
This is not a low-pass filter at all. A low-pass filter passes low frequencies, i.e. it removes fine details (blurring). You obviously need a 2D FFT for that.
This code just removes random bits, essentially.
[edit]
The new code looks a lot more like a low-pass filter. The 141% setting is expected: the diagonal of a square is sqrt(2)=1.41 times its side. Converting an index into a row/column pair should use the image width, not some random unexplained s.
I don't know where your zero frequency is located. That should be easy to spot (largest value) but it might be in (0,0) instead of (w/2,h/2)

C++ eigenvalue/vector decomposition, only need first n vectors fast

I have a ~3000x3000 covariance-alike matrix on which I compute the eigenvalue-eigenvector decomposition (it's a OpenCV matrix, and I use cv::eigen() to get the job done).
However, I actually only need the, say, first 30 eigenvalues/vectors, I don't care about the rest. Theoretically, this should allow to speed up the computation significantly, right? I mean, that means it has 2970 eigenvectors less that need to be computed.
Which C++ library will allow me to do that? Please note that OpenCV's eigen() method does have the parameters for that, but the documentation says they are ignored, and I tested it myself, they are indeed ignored :D
UPDATE:
I managed to do it with ARPACK. I managed to compile it for windows, and even to use it. The results look promising, an illustration can be seen in this toy example:
#include "ardsmat.h"
#include "ardssym.h"
int n = 3; // Dimension of the problem.
double* EigVal = NULL; // Eigenvalues.
double* EigVec = NULL; // Eigenvectors stored sequentially.
int lowerHalfElementCount = (n*n+n) / 2;
//whole matrix:
/*
2 3 8
3 9 -7
8 -7 19
*/
double* lower = new double[lowerHalfElementCount]; //lower half of the matrix
//to be filled with COLUMN major (i.e. one column after the other, always starting from the diagonal element)
lower[0] = 2; lower[1] = 3; lower[2] = 8; lower[3] = 9; lower[4] = -7; lower[5] = 19;
//params: dimensions (i.e. width/height), array with values of the lower or upper half (sequentially, row major), 'L' or 'U' for upper or lower
ARdsSymMatrix<double> mat(n, lower, 'L');
// Defining the eigenvalue problem.
int noOfEigVecValues = 2;
//int maxIterations = 50000000;
//ARluSymStdEig<double> dprob(noOfEigVecValues, mat, "LM", 0, 0.5, maxIterations);
ARluSymStdEig<double> dprob(noOfEigVecValues, mat);
// Finding eigenvalues and eigenvectors.
int converged = dprob.EigenValVectors(EigVec, EigVal);
for (int eigValIdx = 0; eigValIdx < noOfEigVecValues; eigValIdx++) {
std::cout << "Eigenvalue: " << EigVal[eigValIdx] << "\nEigenvector: ";
for (int i = 0; i < n; i++) {
int idx = n*eigValIdx+i;
std::cout << EigVec[idx] << " ";
}
std::cout << std::endl;
}
The results are:
9.4298, 24.24059
for the eigenvalues, and
-0.523207, -0.83446237, -0.17299346
0.273269, -0.356554, 0.893416
for the 2 eigenvectors respectively (one eigenvector per row)
The code fails to find 3 eigenvectors (it can only find 1-2 in this case, an assert() makes sure of that, but well, that's not a problem).
In this article, Simon Funk shows a simple, effective way to estimate a singular value decomposition (SVD) of a very large matrix. In his case, the matrix is sparse, with dimensions: 17,000 x 500,000.
Now, looking here, describes how eigenvalue decomposition closely related to SVD. Thus, you might benefit from considering a modified version of Simon Funk's approach, especially if your matrix is sparse. Furthermore, your matrix is not only square but also symmetric (if that is what you mean by covariance-like), which likely leads to additional simplification.
... Just an idea :)
It seems that Spectra will do the job with good performances.
Here is an example from their documentation to compute the 3 first eigen values of a dense symmetric matrix M (likewise your covariance matrix):
#include <Eigen/Core>
#include <Spectra/SymEigsSolver.h>
// <Spectra/MatOp/DenseSymMatProd.h> is implicitly included
#include <iostream>
using namespace Spectra;
int main()
{
// We are going to calculate the eigenvalues of M
Eigen::MatrixXd A = Eigen::MatrixXd::Random(10, 10);
Eigen::MatrixXd M = A + A.transpose();
// Construct matrix operation object using the wrapper class DenseSymMatProd
DenseSymMatProd<double> op(M);
// Construct eigen solver object, requesting the largest three eigenvalues
SymEigsSolver< double, LARGEST_ALGE, DenseSymMatProd<double> > eigs(&op, 3, 6);
// Initialize and compute
eigs.init();
int nconv = eigs.compute();
// Retrieve results
Eigen::VectorXd evalues;
if(eigs.info() == SUCCESSFUL)
evalues = eigs.eigenvalues();
std::cout << "Eigenvalues found:\n" << evalues << std::endl;
return 0;
}