how to fix slow kmeans of opencv - c++

i use kmeans() for project about bag of words and it is took a lot of time i mean if i have 600 image it took 40-50 mins.and i look source code and that part took most of time:
for( i = 0; i < N; i++ )///very very slow part because N*K is huge
{
sample = data.ptr<float>(i);
int k_best = 0;
double min_dist = DBL_MAX;
for( k = 0; k < K; k++ )
{
const float* center = centers.ptr<float>(k);
double dist = normL2Sqr_(sample, center, dims);
if( min_dist > dist )
{
min_dist = dist;
k_best = k;
}
}
compactness += min_dist;
labels[i] = k_best;
}
and i try but i cant manage to reduce that part ,is there way to make it more efficient it take 22-23 secs in loop and it cause long time to wait programs finish running like 40-50 mins and it cause i cant try other video sets or image sets in program.If there is better kmeans code at c++ that will help too and if there is way to reduce N(# of features) that will help too but K is dictionary size i cant reduce it. Thanks for helps from now.

The k-means implementation in OpenCV is very inefficient, and there are a number of tricks to improve performance that they do not sure. It would be considerable work to re-write it yourself.
The implementation in VLfeat offers better algorithms for k-means, but I don't know about the quality of the implementation.

Related

Faster mathematical operations over a vector using libsimdpp

Searching around on how I can improve my waveform generation code, I've come across SIMD and the libsimdpp library, but I have no idea how to use it. If I got it right using raw SIMD will require me to write code for each architecture while libsimdpp will handle that for me.
What I need to do is, to calculate the squared and rms value of a chunk of samples, which I managed to boost the process using vectorization, which worked perfectly until I introduced the same calculation for both left and right channel of an audio file.
So, my question and what I need help with, is how can I use libsimdpp (or any library that will make simdp easier for me) to improve the bellow code?
// STRAT: vector containing all the audio samples
std::vector<double> samples;
nv_samples = samples.size();
// END
// START: Loop through the samples vector, incrementing ecah time the index with samples_per_pixel
for (int i = 0; i < nb_samples; i+= samples_per_pixel)
{
// START: Create chunk of samples with the size of samples_per_pixel
double* chunk = &samplesL[i];
// END
// START: Calculate rms and sqared sum
float sum = 0;
float squaredsum = 0;
/// there are multiple definitions of above for both channels but I won't include them
//// to make the code easier to be read
for (int j = 0; j < samples_per_pixel; j++)
{
if (chunk[j] < 0)
sum += -chunk[j]
else
sum += chunk[j]
squaredsum += chunk[j] * chunk[j]
}
/// average
float average_point = (sumL * 2) / samples_per_pixel;
// rms
float meanL = squaredsumL / samples_per_pixel;
rms_pointL = qSqrt(meanL);
/// Drawing of both avearge point and rms
//// [...]
// END
}

Time step independence of Molecular Dynamics code

I am writing a basic MD code in C++ using LJ potential for an NVE system. The starting configuration is FCC and the starting velocities are randomly generated.
I am facing a strange problem in that the evolution of the system seems to be independent of the time step I implement, it is my understanding that the energy losses are smaller for small time steps and larger for larger time steps. However I am getting the same result at the end of the simulation in terms of energy whether I run (0.0001step)*(10000steps) or 0.001*1000 and so on.
The entire code is to big for me to post here, so I am posting what I think is relevant and leaving out binning etc., kindly let me know if any additional information is required. I have been through countless codes available online and though they look similar to mine I just am not able to figure out what the difference is and where I am going wrong.
The main cpp contains the following loop
for (int i=0; i<t;i++)
{
potential_calc(neighlist,fromfile, run_parameters,i);//calculating the force fields
velverlet(neighlist,fromfile, run_parameters, bin, dt);//calculating the velocities
}
The declarations of the 2 cpp files for potential calculation & verlet integration are
void potential_calc(neighborlist_type *neighlist, config_type *fromfile, potential *run_parameters, int t)
void velverlet(neighborlist_type *neighlist, config_type *fromfile, potential *run_parameters, bin_type *bin, double dt)
The code for calculating the force - potential_calc.cpp is below
for (long i=0; i<fromfile->N; i++)
{
long atom_p=i;
for (long j=0; j<neighlist[i].countsn; j++)
{
long atom_s=neighlist[i].numb[j];
for (int k=0; k<Dim; k++)
{
dist[k]= fromfile->r[atom_p][k] - (fromfile->r[atom_s][k] + neighlist[atom_p].xyz[j][k]*fromfile->L[k]);
//the .xyz indicates the image being considered real or mirror(if mirror then in which direction)
}
disp2 = pow(dist[0],2)+pow(dist[1],2)+pow(dist[2],2);
if (disp2<rb2)
{
int c1=fromfile->c[atom_p];
int c2=fromfile->c[atom_s];
double long force_temp;
disp=pow(disp2,0.5);
sig_r6=pow(run_parameters->sigma[c1-1][c2-1]/disp,6);//(sigma/r)^6
sig_r8=pow(run_parameters->sigma[c1-1][c2-1]/disp,8);//(sigma/r)^8
run_parameters->pe[atom_p] += (4*run_parameters->epsilon[c1-1][c2-1]*((sig_r6*sig_r6)-sig_r6)) - potential_correction[c1-1][c2-1];
force_temp=(-1*((48*run_parameters->epsilon[c1-1][c2-1])/pow(run_parameters->sigma[c1-1][c2-1],2)*((sig_r6*sig_r8)-((sig_r8)*0.5))));
for (int k=0; k<Dim;k++)
{
run_parameters->force[atom_p][k]+=force_temp*(-1*dist[k]);
}
}
}
//calculating kinetic energy
run_parameters->ke[atom_p] = 0.5*(pow(fromfile->v[atom_p][0],2)+pow(fromfile->v[atom_p][1],2)+pow(fromfile->v[atom_p][2],2));
}
Once the force calculation is done it goes to the updation of velocity and position in the velverlet.cpp
for (long i=0; i<fromfile->N; i++)
{
for (int j=0; j<Dim; j++)
{
fromfile->v[i][j] += (dt*run_parameters->force[i][j]);
}
}
for (long i=0; i<fromfile->N; i++)
{
for (int j=0; j<Dim; j++)
{
fromfile->r[i][j] += dt*fromfile->v[i][j];
}
}
There may be slight differences in how velocity verlet is implemented by different people but I can't figure out how I am getting time step independent results.
Please help. Any input is appreciated
Sorry if any formatting/tagging is wrong, this is the first time I am posting here

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
}
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
{
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
}
else {
groupSize = ceil( rows / threads );
}
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
}
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
}
virtual void operator()(const cv::Range& range) const
{
for(int r = range.start; r < range.end; r++)
{
T* Ip = nullptr;
cv::Range ran = ranges.at(r);
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
else {
for(int i = ran.start; i < ran.end; ++i)
{
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
}
}
}
};
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

Fastest way to calculate distance between all rows in a dense eigen::matrix

I am trying to calculate the euclidean distance between every pair of rows in a 1000x1000 matrix using Eigen. What I have so far is something along these lines:
for (int i = 0; i < matrix.rows(); ++i){
VectorXd refRow = matrix.row(i);
for (int j = i+1; j < matrix.rows(); ++j){
VectorXd eleRow = matrix.row(j);
euclid_distance = (refRow - eleRow).lpNorm<2>();
...
}
}
My code includes other code here replaced with "..." but for testing the performance I have removed it.
Now I don't expect this to run at the speed of light but it is taking a lot more than I expected. Am I doing something wrong in using C++ \ the Eigen library that might be slowing this down?
Is there any other preferred method?

Efficient 2D FFT of fixed length real input data in C/C++

I'm developing an algorithm that calls several times to a FFT function. I have several time constraints (real-time desired) so I need to minimize the time expended in every FFT call.
I'm working with OpenCV library and I have already implemented my code with two different approaches:
Using FFTW library. Data/memory management + FFT(8ms) = 14ms (in mean, FFT_MEASURE flag).
Using OpenCV fft function. Data/memory management + FFT (21ms) = 23ms (in mean).
As my input data is always fixed as a real image of 512x512 pixels, do you think if I implement myself the FFT algorithm based in the mathematical definition of DFT, storing the sine/cosine tables can I achieve better performance or the FFTW library is really very optimized? Any better ideas?
All ideas and suggestions will be really appreciated. By now, I don't consider paralellization or GPU implementation.
Thank you
Update:
System: Intel Xeon 5130 2.0GHz CPU in Windows 7, Visual Studio 10.0 and FFTW 3.3.3 (compiled following instructions in the site), OpenCV 2.4.3.
Code example for FFT call with FFTW (input: OpenCV Mat CV_32F (1 channel, float type), output OpenCV Mat CV_32FC2 (2 channels, float type):
float *im_data;
fftwf_complex *data_in;
fftwf_complex *fft;
fftwf_plan plan_f;
int i, j, k;
int height=I.rows;
int width=I.cols;
int N=height*width;
float* outdata = new float[2*N];
im_data = ( float* ) I.data;
data_in = ( fftwf_complex* )fftwf_malloc( sizeof( fftwf_complex ) * N );
fft = ( fftwf_complex* )fftwf_malloc( sizeof( fftwf_complex ) * N );
plan_f = fftwf_plan_dft_2d( height , width , data_in , fft , FFTW_FORWARD , FFTW_MEASURE );
for(int i = 0,k=0; i < height; ++i) {
float* row = I.ptr<float>(i);
for(int j = 0; j < width; j++) {
data_in[k][0]=(float)row[j];
data_in[k][1] =(float)0.0;
k++;
}
}
fftwf_execute( plan_f );
int width2=2*width;
// writing output matrix: RealFFT[0],ImaginaryFFT[0],RealFFT[1],ImaginaryFFT[1],...
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width2 ; j++ ) {
outdata[i * width2 + j] = ( float )fft[k][0];
outdata[i * width2 + j+1] = ( float )fft[k][1];
j++;
k++;
}
}
Mat fft_I(height,width,CV_32FC2,outdata);
fftwf_destroy_plan( plan_f );
fftwf_free( data_in );
fftwf_free( fft );
return fft_I;
Your FFT time with FFTW seems very high. To get the best of out FFTW with fixed size FFTs you should generate a plan using the FFTW_PATIENT flag and then ideally save the generated "wisdom" for subsequent re-use. You can generate wisdom either from your own code or using the fftw-wisdom tool.
The FFT from the Intel Math Kernel Library (separate from the Intel compiler) is faster than FFTW most of the time. I don't know if it will be enough of an improvement in your case to justify the price though.
I will agree with the others that rolling your own FFT is probably not a good use of your time (unless you are wanting to learn how to do it). The available FFT implementations (FFTW, MKL) have been so finely tuned over many years. I'm not saying that you can't do better, but it would probably be a lot of work and time for marginal gains.
Believe me fftw is realy very optimized, there is very small chance, that you can do it better.
Which compiler you have used for compiling fftw? Sometimes compiler from Intel gives better perfomance than gcc