Anybody know about the fastest method for calculating convolution? Unfortunately the matrix which I deal with is very large (500x500x200) and if I use convn in MATLAB it takes a long time (I have to iterate this calculation in a nested loop). So, I used convolution with FFT and it is faster now. But, I am still looking for a faster method. Any idea?

If your kernel is separable, the greatest speed gains will be realized by performing multiple sequential 1D convolutions.
Steve Eddins of MathWorks describes how to take advantage of the associativity of convolution to speed up convolution when the kernel is separable in a MATLAB context on his blog. For a P-by-Q kernel, the computational advantage of performing two separate and sequential convolutions vs. 2D convolution is PQ/(P+Q), which corresponds to 4.5x for a 9x9 kernel and ~11x for a 15x15 kernel. EDIT: An interesting unwitting demonstration of this difference was given in this Q&A.
To figure out if the kernel is separable (i.e. the outer product of two vectors) the blog goes on to describe how to check if your kernel is separable with SVD and how to get the 1D kernels. Their example is for a 2D kernel. For a solution for N-dimensional separable convolution, check this FEX submission.
Another resource worth pointing out is this SIMD (SSE3/SSE4) implementation of 3D convolution by Intel, which includes both source and a presentation. The code is for 16 bit integers. Unless you move to GPU (e.g. cuFFT), it is probably hard to get faster than Intel's implementations, which also includes Intel MKL. There is an example of 3D convolution (single-precision float) at the bottom of this page of the MKL documentation (link fixed, now mirrored in

You could try the overlap-add and overlap-save methods. They involve breaking up your input signal into smaller chunks and then using either of the above methods.
An FFT is most likely - and I could be wrong - the fastest method, especially if you are using built-in routines in MATLAB or a library in C++. Apart from that, breaking the input signal into smaller chunks should be a good bet.

i have 2 way to calc fastconv
and 2 betther than 1
1- armadillo
you can use armadillo library for calcing conv with this code
cx_vec signal(1024,fill::randn);
cx_vec code(300,fill::randn);
cx_vec ans = conv(signal,code);
2-use fftw ans sigpack and armadillo library for calcing fast conv in this way you must init fft of your code in constructor
FastConvolution::FastConvolution(cx_vec inpCode)
filterCode = inpCode;
fft_w = NULL;
cx_vec FastConvolution::filter(cx_vec inpData)
int length = inpData.size()+filterCode.size();
if((length & (length - 1)) == 0)
length = pow(2 , (int)log2(length) + 1);
if(length != fftCode.size())
static cx_vec zeroPadedData;
if(length!= zeroPadedData.size())
zeroPadedData.subvec(0,inpData.size()-1) = inpData;
cx_vec fftSignal = fft_w->fft_cx(zeroPadedData);
cx_vec mullAns = fftSignal % fftCode;
cx_vec ans = fft_w->ifft_cx(mullAns);
return ans.subvec(filterCode.size(),inpData.size()+filterCode.size()-1);
void FastConvolution::initCode(int length)
if(fft_w != NULL)
delete fft_w;
fft_w = new sp::FFTW(length,FFTW_ESTIMATE);
cx_vec conjCode(length,fill::zeros);
for(int i = 0; i < filterCode.size();i++)
{ = - i - 1);
conjCode = conj(conjCode);
fftCode = fft_w->fft_cx(conjCode);


How Can I make it faster in c++11 with std::vector?

I have cv::Mat Mat_A and cv::Mat Mat_B both are (800000 X 512) floats
and below code is looks slow .
int rows = Mat_B.rows;
cv::Mat Mat_A = cv::repeat(img, rows, 1, Mat_A);
Mat_A = Mat_A - Mat_B
cv::reduce(Mat_A, Mat_A, 1, CV_REDUCE_SUM);
cv::minMaxLoc(Mat_A, &dis, 0, &point, 0);
How Can I do this in std::vector ?
I think it should be faster.
In my 2.4 Ghz mabook pro it took 4 sec ? very slow.
I don't think you should use std::vector to do these operations. Image processing (CV aka Computer Vision) algorithms tend to be quite computationally heavy because there is so much data to deal with. OpenCV 2.0 C++ is highly optimized for this kind of operations, e.g. cv::Mat has a header and whenever a cv::Mat is copied with copy assignment or constructor, only the headers are copied with a pointer to the data. They use reference counting to keep track of instances. So memory management is done for you, and that's a good thing.
You could try to compile without debug symbols, i.e. release vs debug. And you can also try to compile with optimization flags, e.g. for gcc -O3 which should reduce the size of your binary and speed up runtime operations. Maybe it might make a difference.
Another thing you could try is to give your process a higher priority, i.e. the higher the priority, the less it the process yields the CPU. Again, that might not make a lot of difference, it all depends of other processes and their priorities, etc.
I hope that helps a bit.
Well your thinking is wrong.
Why your program is slow:
Your CPU have to loop through a lot of number and do calculation. This will make computation complexity high. That's why it's slow. Your program's speed is in proportion to size of Mat A and B. You can check this point by reducing/increasing the size of Mat A and B.
Can we accelerate it by std::vector
Sorry but it's no. Using std::vector will not reduce the calculation complexity. The math arthmetic of opencv is da "best", re-writing will only lead to slower code.
How to accelerate the calculation: you need to enable the acceleration options for opencv
you can see it at : . Intel provide intel mkl library to accelerate the matrix calculation. You could try it first.
Personally, the easiest approach is to use the GPU. But your machine doesn't have GPU, so it's out of the scope here.
You keep iterating over the data over and over again to do independent operations on them.
Something like this iterates only once over the data.
//assumes Mat_B and img cv::Mat
using px_t = float;//you mentioned float so I'll assume both img and Mat_B use floats
int rows = Mat_B.rows;
cv::Mat output(1,rows, Mat_B.type());
auto output_ptr = output.ptr<px_t>(0);
auto img_ptr = img.ptr<px_t>(0);
int min_idx =0;
int max_idx =0;
px_t min_ele = std::numeric_limits<px_t>::max();
px_t max_ele = std::numeric_limits<px_t>::min();
for(int i = 0; i< rows; ++i)
auto mat_row = Mat_B.ptr<px_t>(i);
for(int j = 0; j< Mat_B.cols; ++j)
output[i] +=(img_ptr[j]-mat_row[j])*(img_ptr[j]-mat_row[j]);
min_idx = i;
min_ele = output[i];
max_idx = i;
max_ele = output[i];
While I am also not sure if it is faster you can do this, assuming Mat_B contains uchar
std::vector<uchar> array_B(Mat_B.rows* Mat_B.cols);
array_B =;

Fast, good quality pixel interpolation for extreme image downscaling

In my program, I am downscaling an image of 500px or larger to an extreme level of approx 16px-32px. The source image is user-specified so I do not have control over its size. As you can imagine, few pixel interpolations hold up and inevitably the result is heavily aliased.
I've tried bilinear, bicubic and square average sampling. The square average sampling actually provides the most decent results but the smaller it gets, the larger the sampling radius has to be. As a result, it gets quite slow - slower than the other interpolation methods.
I have also tried an adaptive square average sampling so that the smaller it gets the greater the sampling radius, while the closer it is to its original size, the smaller the sampling radius. However, it produces problems and I am not convinced this is the best approach.
So the question is: What is the recommended type of pixel interpolation that is fast and works well on such extreme levels of downscaling?
I do not wish to use a library so I will need something that I can code by hand and isn't too complex. I am working in C++ with VS 2012.
Here's some example code I've tried as requested (hopefully without errors from my pseudo-code cut and paste). This performs a 7x7 average downscale and although it's a better result than bilinear or bicubic interpolation, it also takes quite a hit:
// Sizing control
ctl(0): "Resize",Range=(0,800),Val=100
// Variables
float fracx,fracy;
int Xnew,Ynew,p,q,Calc;
int x,y,p1,q1,i,j;
//New image dimensions
for (y=0; y<image->height; y++){ // rows
for (x=0; x<image->width; x++){ // columns
for (z=0; z<3; z++){ // channels
for (i=-3;i<=3;i++) {
for (j=-3;j<=3;j++) {
Calc += (int)(src(p1-i,q1-j,z));
} //j
} //i
Calc /= 49;
pset(x, y, z, Calc);
} // channels
} // columns
} // rows
The first point is to use pointers to your data. Never use indexes at every pixel. When you write: src(p1-i,q1-j,z) or pset(x, y, z, Calc) how much computation is being made? Use pointers to data and manipulate those.
Second: your algorithm is wrong. You don't want an average filter, but you want to make a grid on your source image and for every grid cell compute the average and put it in the corresponding pixel of the output image.
The specific solution should be tailored to your data representation, but it could be something like this:
std::vector<uint32_t> accum(Xnew);
std::vector<uint32_t> count(Xnew);
uint32_t *paccum, *pcount;
uint8_t* pin = /*pointer to input data*/;
uint8_t* pout = /*pointer to output data*/;
for (int dr = 0, sr = 0, w = image->width, h = image->height; sr < h; ++dr) {
memset(paccum =, 0, Xnew*4);
memset(pcount =, 0, Xnew*4);
while (sr * Ynew / h == dr) {
paccum =;
pcount =;
for (int dc = 0, sc = 0; sc < w; ++sc) {
*paccum += *i;
*pcount += 1;
if (sc * Xnew / w > dc) {
std::transform(begin(accum), end(accum), begin(count), pout, std::divides<uint32_t>());
pout += Xnew;
This was written using my own library (still in development) and it seems to work, but later I changed the variables names in order to make it simpler here, so I don't guarantee anything!
The idea is to have a local buffer of 32 bit ints which can hold the partial sum of all pixels in the rows which fall in a row of the output image. Then you divide by the cell count and save the output to the final image.
The first thing you should do is to set up a performance evaluation system to measure how much any change impacts on the performance.
As said precedently, you should not use indexes but pointers for (probably) a substantial
speed up & not simply average as a basic averaging of pixels is basically a blur filter.
I would highly advise you to rework your code to be using "kernels". This is the matrix representing the ratio of each pixel used. That way, you will be able to test different strategies and optimize quality.
Example of kernels:
Upsampling/downsampling kernel:
Note, from the code it seems you apply a 3x3 kernel but initially done on a 7x7 kernel. The equivalent 3x3 kernel as posted would be:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]

FFTW vs Matlab FFT

I posted this on matlab central but didn't get any responses so I figured I'd repost here.
I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.
Here's the mex portion:
void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
// Check for wisdom
if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
mexPrintf("wisdom not loaded.\n");
} else {
mexPrintf("wisdom loaded.\n");
// Set FFTW Plan - use interleaved FFTW
fftw_plan plan_forward_d_buffer;
fftw_plan plan_forward_A_vec;
fftw_plan plan_backward_Ad_buffer;
fftw_complex *A_vec_fft;
fftw_complex *d_buffer_fft;
A_vec_fft = fftw_alloc_complex(n);
d_buffer_fft = fftw_alloc_complex(n);
// CREATE MASTER PLAN - Do this on an empty vector as creating a plane
// with FFTW_MEASURE will erase the contents;
// Use d_buffer
// This is somewhat dangerous because Ad_buffer is a vector; but it does not
// get resized so &Ad_buffer[0] should work
plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
// A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);
// Get A_vec_fft
// Find initial direction - this is the initial residual
for (int i=0;i<n;i++) {
d_buffer[i] = b.value[i];
r_buffer[i] = b.value[i];
// Start CG iterations
norm_ro = norm(r_buffer);
double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {
// Find Ad - use fft
// Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
// has complex elements; Overwrite d_buffer_fft
for (int i=0;i<n;i++) {
d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
// Calculate r'*r
rtr_buffer = 0;
for (int i=0;i<n;i++) {
rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
// Calculate alpha
alpha = 0;
for (int i=0;i<n;i++) {
alpha = alpha + d_buffer[i]*Ad_buffer[i];
alpha = rtr_buffer/alpha;
// Calculate new x
for (int i=0;i<n;i++) {
x[i] = x[i] + alpha*d_buffer[i];
// Calculate new residual
for (int i=0;i<n;i++) {
r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
// Calculate beta
beta = 0;
for (int i=0;i<n;i++) {
beta = beta + r_buffer[i]*r_buffer[i];
beta = beta/rtr_buffer;
// Calculate new direction vector
for (int i=0;i<n;i++) {
d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
*total_counter = *total_counter+1;
if(*total_counter >= iteration_cutoff) {
// Set total_counter to -1, this indicates failure
*total_counter = -1;
// Store Wisdom
// Free fft alloc'd memory and plans
Here's the matlab portion:
% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once
% Find initial direction - this is the initial residual
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
d(i) = b(i);
r(i) = b(i);
% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);
while(norm(r)/norm_ro > 10^-6)
% Find Ad - use fft
Ad_buffer = ifft(A_vec_fft.*fft(d));
% Calculate rtr_buffer
rtr_buffer = r'*r;
% Calculate alpha
alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));
% Calculate new x
x = x + alpha*d(1:n);
% Calculate new residual
r = r - alpha*Ad_buffer(1:n);
% Calculate beta
beta = r'*r/(rtr_buffer);
% Calculate new direction vector
d(1:n) = r + beta*d(1:n);
% Update counter
total_counter = total_counter+1;
In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks
A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:
Based on the code you have, I can see two explanations for the speed difference:
the speed difference is explained by differences in levels of optimization of the FFT
the while loop in MATLAB is executed a significantly smaller number of times
I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)
Now, regarding FFT speed comparison:
Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
This is classic performance gain thanks to low-level and architecture-specific optimization.
Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.
FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.
I have found the following comment on the MathWorks website [1]:
Note on large powers of 2: For FFT dimensions that are powers of
2, between 2^14 and 2^22, MATLAB software uses special preloaded
information in its internal database to optimize the FFT computation.
No tuning is performed when the dimension of the FTT is a power of 2,
unless you clear the database using the command fftw('wisdom', []).
Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.
[1] R2013b Documentation available from (accessed on 29 Oct 2013)
EDIT: #wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!
First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.
In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).
In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.
Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.
X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
for (size_t i=0; i<N; ++i) {
X[i][0] = Xre[i];
X[i][1] = Xim[i];
for (size_t i=0; i<N; ++i) {
Yre[i] = Y[i][0];
Yim[i] = Y[i][1];
Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.
I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.
In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.
EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.
One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.
p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.

cv::mat CV_8U product error and slow CV_32F product

I am trying to make a product between a 2772x128 matrix and a 4000x128 matrix. Both are matrices of SIFT descriptors, using next code:
Mat a = Mat(nframes, descrSize, CV_8U, DATAdescr);
Mat b = Mat(vocabulary_size, descrSize, CV_8U, vocabulary);
Mat ab =a * b.t();
The problem is that when calculating the product, it throws an error saying
err_msg = 0x00cdd5e0 "..\..\..\src\opencv\modules\core\src\matmul.cpp:711: error: (-215) type == B.type() && (type == CV_32FC1 || type == CV_64FC1 || type == CV_32FC2 || type == CV_64FC2)"
The solution to this has been to convert the data type to CV_32FC1
Mat a = Mat(nframes, descrSize, CV_8U, DATAdescr);
Mat b = Mat(vocabulary_size, descrSize, CV_8U, vocabulary);
a.convertTo(a, CV_32FC1);
b.convertTo(b, CV_32FC1);
Mat ab = a * b.t();
It works well, but it is consuming too much time, about 1.2 s. I would like to try the same product but using integers, to see if I can speed this up. Am I doing something wrong? I can't see any reason I cannot do matrix product between CV_8U matrices.
EDIT: The answers are related to using other libraries or solving other way. I was thinking on opening a new thread with advice to solve my problem, but can anybody answer my original quiestion pleas? Can I not multiply CV_8U or CV32S matrices? Really?
In your other message you said that the following code would take 0.9 seconds.
MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;
I tried a little benchmark on my machine, intel core i7 running on linux. My full benchmark code is the following:
#include <Eigen/Dense>
using namespace Eigen;
main(int argc, char *argv[])
MatrixXd A = MatrixXd::Random(2772, 128);
MatrixXd B = MatrixXd::Random(4000, 128);
MatrixXd X = A*B.transpose();
I just use the time command from linux so the running time includes the launching and stopping of the executable.
1/ Compiling with no optimisation (gcc compiler):
g++ -I/usr/include/eigen3 matcal.cpp -O0 -o matcal
time ./matcal
real 0m13.177s -> this is the time you should be looking at
user 0m13.133s
sys 0m0.022s
13 seconds, that's very slow. By the way, without matrix multiplication it takes 0.048s, with bigger matrices that in your 0.9s example. Why ??
Using compilers optimisation with Eigen is very important.
2/ Compiling with some optimisation:
g++ -I/usr/include/eigen3 matcal.cpp -O2 -o matcal
time ./matcal
real 0m0.324s
user 0m0.298s
sys 0m0.024s
Now 0.324s, that's better!
3/ Switching all the optimization flags (at least all that I know of, I'm not an expert in this field)
g++ -I/usr/include/eigen3 matcal.cpp -O3 -march=corei7 -mtune=corei7 -o matcal
time ./matcal
real 0m0.317s
user 0m0.291s
sys 0m0.024s
0.317, close, but a few ms gained (consistantly for a few tests). So in my opinion you do have a problem with your usage of Eigen, either you dont switch compiler optimization or your compiler does not do it by itself.
I'm not an expert in Eigen I have only used it a few time but I think the documentation is quite good and you probably should read it to get the most of it.
Concerning performance comparison with MatLab, last time I read about Eigen it was not multithreaded while MatLab probably use multithreaded libraries. For matrix multiplication you could split up your matrix in several chunks and parallelize multiplication of each chunk using TBB
Suggested by remi, I implemented the same matrix multiplication using Eige. Here it is:
const int descrSize = 128;
MatrixXi a(nframes, descrSize);
MatrixXi b(vocabulary_size, descrSize);
MatrixXi ab(nframes, vocabulary_size);
unsigned char* dataPtr = DATAdescr;
for (int i=0; i<nframes; ++i)
for (int j=0; j<descrSize; ++j)
unsigned char* vocPtr = vocabulary;
for (int i=0; i<vocabulary_size; ++i)
for (int j=0; j<descrSize; ++j)
b(i,j)=(int)*vocPtr ++;
ab = a*b.transpose();
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();
The key line is the line that says
ab = a*b.transpose();
vocabulary an DATAdescr are arrays of unsigned char. DATAdescr is 2782x128 and vocabulary is 4000x128. I saw at implementation that I can use Map, but I failed at first to use it. The initial loops for assigment are 0.001 cost, so this is not a bottleneck. The whole process is about 1.23 s
The same implementation in matlab (0.05s.) is:
aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b';
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
Thanks remi in advance for you help.
If you multiply a matrix you multiply element values and sum them - if you only have a range of 0-255 it's quite likely that the product is going to be more than 255. So a productof a CV_8U matrix isn't very useful.
If you know that your result will fit in a byte you can ust do the multiplication yourself with looping over the elements.
edit: I'm a little surprised that the float version is so much slower, generally opencv is pretty good performance wise - with multi-core and optomised SSE2 instructions. Did you build from source? Do you have TBB (ie mutlithreading) and an SSE2 cpu?
Try compiling OpenCV using EIGEN as back end. There is an option for this in the CMakeList. I read in your command that you use OpenCV just to speed up the matrix multiplication, soi you might even wanna try EIGEN directly.
One last solution, use the GPU module of OpenCV.

What algorithm does OpenCV's Bayer conversion use?

I would like to implement a GPU Bayer to RGB image conversion algorithm, and I was wondering what algorithm the OpenCV cvtColor function uses. Looking at the source I see what appears to be a variable number of gradients algorithm and a basic algorithm that could maybe be bilinear interpolation? Does anyone have experience with this that they could share with me, or perhaps know of GPU code to convert from Bayer to BGR format?
The source code is in imgproc/src/color.cpp. I'm looking for a link to it. Bayer2RGB_ and Bayer2RGB_VNG_8u are the functions I'm looking at.
Edit: Here's a link to the source.
I've already implemented a bilinear interpolation algorithm, but it doesn't seem to work very well for my purposes. The picture looks ok, but I want to compute HOG features from it and in that respect it doesn't seem like a good fit.
Default is 4way linear interpolation or variable number of gradients if you specify the VNG version.
see ..\modules\imgproc\src\color.cpp for details.
I submitted a simple linear CUDA Bayer->RGB(A) to opencv, haven't followed if it's been accepted but it should be in the bugs tracker.
It's based on the code in Cuda Bayer/CFA demosaicing example.
Here is a sample of howto use cv::GPU in your own code.
/*-------RG ccd BGRA output ----------------------------*/
__global__ void bayerRG(const cv::gpu::DevMem2Db in, cv::gpu::PtrStepb out)
// Note called for every pair, so x/y are for start of cell so need x+1,Y+1 for right/bottom pair
// R G
// G B
// src
int x = 2 * ((blockIdx.x*blockDim.x) + threadIdx.x);
int y = 2 * ((blockIdx.y*blockDim.y) + threadIdx.y);
uchar r,g,b;
// 'R'
r = (in.ptr(y)[x]);
g = (in.ptr(y)[x-1]+in.ptr(y)[x+1]+(in.ptr(y-1)[x]+in.ptr(y+1)[x]))/4;
b = (in.ptr(y-1)[x-1]+in.ptr(y-1)[x+1]+(in.ptr(y+1)[x-1]+in.ptr(y+1)[x+1]))/4;
((uchar4*)out.ptr(y))[x] = make_uchar4( b,g,r,0xff);
// 'G' in R
r = (in.ptr(y)[x]+in.ptr(y)[x+2])/2;
g = (in.ptr(y)[x+1]);
b = (in.ptr(y-1)[x+1]+in.ptr(y+1)[x+1])/2;
((uchar4*)out.ptr(y))[x+1] = make_uchar4( b,g,r,0xff);
// 'G' in B
r = (in.ptr(y)[x]+in.ptr(y+2)[x])/2;
g = (in.ptr(y+1)[x]);
b = (in.ptr(y+1)[x-1]+in.ptr(y+1)[x+2])/2;
((uchar4*)out.ptr(y+1))[x] = make_uchar4( b,g,r,0xff);
// 'B'
r = (in.ptr(y)[x]+in.ptr(y)[x+2]+in.ptr(y+2)[x]+in.ptr(y+2)[x+2])/4;;
g = (in.ptr(y+1)[x]+in.ptr(y+1)[x+2]+in.ptr(y)[x+1]+in.ptr(y+2)[x+1])/4;
b = (in.ptr(y+1)[x+1]);
((uchar4*)out.ptr(y+1))[x+1] = make_uchar4( b,g,r,0xff);
/* called from */
extern "C" void cuda_bayer(const cv::gpu::DevMem2Db& img, cv::gpu::PtrStepb out)
dim3 threads(16,16);
dim3 grid((img.cols/2)/(threads.x), (img.rows/2)/(threads.y));
Currently, to my knowledge, the best debayer out there is DFPD (directional filtering with posteriori decision) as explained in this paper. The paper is quite explanatory and you can easily prototype this approach on Matlab. Here's a blog post comparing the results of DFPD to debayer based on linear approach. You can visibly see the improvement in artifacts, colors and sharpness.
As far as I know at this point it is using adaptive homogeneity directed demosaicing. Explained in a paper by Hirakawa and many other sources on the web.