Efficient 2D FFT of fixed length real input data in C/C++ - c++

I'm developing an algorithm that calls several times to a FFT function. I have several time constraints (real-time desired) so I need to minimize the time expended in every FFT call.
I'm working with OpenCV library and I have already implemented my code with two different approaches:
Using FFTW library. Data/memory management + FFT(8ms) = 14ms (in mean, FFT_MEASURE flag).
Using OpenCV fft function. Data/memory management + FFT (21ms) = 23ms (in mean).
As my input data is always fixed as a real image of 512x512 pixels, do you think if I implement myself the FFT algorithm based in the mathematical definition of DFT, storing the sine/cosine tables can I achieve better performance or the FFTW library is really very optimized? Any better ideas?
All ideas and suggestions will be really appreciated. By now, I don't consider paralellization or GPU implementation.
Thank you
Update:
System: Intel Xeon 5130 2.0GHz CPU in Windows 7, Visual Studio 10.0 and FFTW 3.3.3 (compiled following instructions in the site), OpenCV 2.4.3.
Code example for FFT call with FFTW (input: OpenCV Mat CV_32F (1 channel, float type), output OpenCV Mat CV_32FC2 (2 channels, float type):
float *im_data;
fftwf_complex *data_in;
fftwf_complex *fft;
fftwf_plan plan_f;
int i, j, k;
int height=I.rows;
int width=I.cols;
int N=height*width;
float* outdata = new float[2*N];
im_data = ( float* ) I.data;
data_in = ( fftwf_complex* )fftwf_malloc( sizeof( fftwf_complex ) * N );
fft = ( fftwf_complex* )fftwf_malloc( sizeof( fftwf_complex ) * N );
plan_f = fftwf_plan_dft_2d( height , width , data_in , fft , FFTW_FORWARD , FFTW_MEASURE );
for(int i = 0,k=0; i < height; ++i) {
float* row = I.ptr<float>(i);
for(int j = 0; j < width; j++) {
data_in[k][0]=(float)row[j];
data_in[k][1] =(float)0.0;
k++;
}
}
fftwf_execute( plan_f );
int width2=2*width;
// writing output matrix: RealFFT[0],ImaginaryFFT[0],RealFFT[1],ImaginaryFFT[1],...
for( i = 0, k = 0 ; i < height ; i++ ) {
for( j = 0 ; j < width2 ; j++ ) {
outdata[i * width2 + j] = ( float )fft[k][0];
outdata[i * width2 + j+1] = ( float )fft[k][1];
j++;
k++;
}
}
Mat fft_I(height,width,CV_32FC2,outdata);
fftwf_destroy_plan( plan_f );
fftwf_free( data_in );
fftwf_free( fft );
return fft_I;

Your FFT time with FFTW seems very high. To get the best of out FFTW with fixed size FFTs you should generate a plan using the FFTW_PATIENT flag and then ideally save the generated "wisdom" for subsequent re-use. You can generate wisdom either from your own code or using the fftw-wisdom tool.

The FFT from the Intel Math Kernel Library (separate from the Intel compiler) is faster than FFTW most of the time. I don't know if it will be enough of an improvement in your case to justify the price though.
I will agree with the others that rolling your own FFT is probably not a good use of your time (unless you are wanting to learn how to do it). The available FFT implementations (FFTW, MKL) have been so finely tuned over many years. I'm not saying that you can't do better, but it would probably be a lot of work and time for marginal gains.

Believe me fftw is realy very optimized, there is very small chance, that you can do it better.
Which compiler you have used for compiling fftw? Sometimes compiler from Intel gives better perfomance than gcc

Related

Is there a way to use "unified memory" (MAGMA) with 2 GPU cards with NVLink and 1TB RAM

At work, On Debian 10, I have 2 GPU cards RTX A6000 with NVlink harware component with 1TB of RAM and I would like to benefit of the potential combined power of both cards and 1TB RAM.
Currently, I have the following magma.make invoked by a Makefile :
CXX = nvcc -std=c++17 -O3
LAPACK = /opt/intel/oneapi/mkl/latest
LAPACK_ANOTHER=/opt/intel/mkl/lib/intel64
MAGMA = /usr/local/magma
INCLUDE_CUDA=/usr/local/cuda/include
LIBCUDA=/usr/local/cuda/lib64
SEARCH_DIRS_INCL=-I${MAGMA}/include -I${INCLUDE_CUDA} -I${LAPACK}/include
SEARCH_DIRS_LINK=-L${LAPACK}/lib/intel64 -L${LAPACK_ANOTHER} -L${LIBCUDA} -L${MAGMA}/lib
CXXFLAGS = -c -DMAGMA_ILP64 -DMKL_ILP64 -m64 ${SEARCH_DIRS_INCL}
LDFLAGS = ${SEARCH_DIRS_LINK} -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lcuda -lcudart -lcublas -lmagma -lpthread -lm -ldl
SOURCES = main_magma.cpp XSAF_C_magma.cpp
EXECUTABLE = main_magma.exe
When I execute my code, I have memory errors since in this code, I try to inverse matrixes of size 120k x 120k.
If we lookt at closer, 120k x 120k matrixes requires in double precision : 120k x 120k x 8 bytes, so alsmost 108GB.
The functions implied can't accept single precision.
Unfortunately, I have 2 NVIDIA GPU cards of 48GB each one :
Question :
Is there a way, from a computation point of view or, from a coding point of view, to merge the 2 memory of 2 GPU cards (that would give 96GB) in order to inverse these large matrixes ?
I am using MAGMA to compile and for routine of inversion like this :
// ROUTINE MAGMA IMPLEMENTED
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Start magma part
magma_int_t m = F_matrix.size();
if (m) {
magma_init (); // initialize Magma
magma_queue_t queue=NULL;
magma_int_t dev=0;
magma_queue_create(dev ,&queue );
double gpu_time , *dwork; // dwork - workspace
magma_int_t ldwork; // size of dwork
magma_int_t *piv, info; // piv - array of indices of inter -
magma_int_t mm=m*m; // size of a, r, c
double *a; // a- mxm matrix on the host
double *d_a; // d_a - mxm matrix a on the device
double *d_c; // d_c - mxm matrix c on the device
magma_int_t ione = 1;
magma_int_t ISEED [4] = { 0,0,0,1 }; // seed
magma_int_t err;
const double alpha = 1.0; // alpha =1
const double beta = 0.0; // beta=0
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
// allocate matrices
err = magma_dmalloc_cpu( &a , mm ); // host memory for a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
a[idx] = F_matrix[i][j];
//cout << "a[" << idx << "]" << a[idx] << endl;
}
}
err = magma_dmalloc( &d_a , mm ); // device memory for a
err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.
magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a
magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
magma_dgetri_gpu(m, d_a, m, piv, dwork, ldwork, &info);
magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
F_output[i][j] = a[idx];
}
}
// SAVE ORIGINAL
free(a); // free host memory
free(piv); // free host memory
magma_free(d_a); // free device memory
magma_queue_destroy(queue); // destroy queue
magma_finalize ();
// End magma part
}
}
If this is not possible to do it directly with NVlink harware component between both GPU cards, which workaround could we find to allow this matrix inversion ?
Edit :
I was told by a HPC engineer :
"The easiest way will be to use the Makefiles until we figure out how
cmake can support that. If you do that, you can just replace
LAPACKE_dgetrf by magma_dgetrf. MAGMA will use internally one GPU with
out-of-memory algorithm that fill factor the matrix, even if it is
large and does not fir into the memory of the GPU."
Does it mean that I have to find the appropriate flags of Makefile to be able to use magma_dgetrf instead of LAPACKE_dgetrf ?
And for the second sentence, it is said that
"MAGMA will use internally one GPU with out-of-memory algorithm that
fill factor the matrix"
Does it mean that if my matrix
is over 48GB, then MAGMA will be able to fill the rest into the second GPU A6000 or in the RAM and perform the inversion of the full matrix ?
Please, let me know which flags to use to build correctly MAGMA in my case.
Currrently, I do :
$ mkdir build && cd build
$ cmake -DUSE_FORTRAN=ON \
-DGPU_TARGET=Ampere \
-DLAPACK_LIBRARIES="/opt/intel/oneapi/intelpython/latest/lib/liblapack.so" \
-DMAGMA_ENABLE_CUDA=ON ..
$ cmake --build . --config Release
I am not an expert in GP/GPU computation, but I would be very surprised if you could combine two compute devices into a single device. At least I don't think it's possible using a standard library. If you think about it, it sort of defeats the purpose of using a GPU in the first place.
However, I would say that once you use very large matrices you hit many problems, which make a text-book inverse operation numerically unstable. The normal way around this is instead to never store an inverse matrix, at all. Often you only require an inverse matrix to be able to solve
Ax = b (solve for x)
Ax - b = 0 (homogenous form)
Which can be solved without inverse-A
I would suggest that you need to start by reading the inverse-matrix chapter of Numerical Recipes in C/C++. This is a standard text, with example code, and is widely available from Amazon, etc. These texts assume CPU implementation, but...
Once you understand these algorithms, you may (or may not) find that being able to issue two parallel non-inverse matrix operations is useful to you. However the algorithms described in this (and other texts) are orders of magnitude faster than any brute force operation anyway.

Procedural Landscape Generation from Data (BP, C++)

in UE4 we want to create procedural landscapes from freely available geological and height data. We've been following the book "Unreal Engine 4 Scripting with C++ Cookbook", which is a little older. The code adapted accordingly also works well, until it comes to updating the landscape. It crashes at:
int32 numHeights = (rect.Width()+1)*(rect.Height()+1);
TArray<uint16> Data;
Data.Init( 0, numHeights );
for( int i = 0; i < Data.Num(); i++ ) {
float nx = (i % cols) / cols; // normalized x value
float ny = (i / cols) / rows; // normalized y value
Data[i] = GeoHeightData( nx, ny, 16, 4, 4 );
}
LandscapeEditorUtils::SetHeightmapData( landscape, Data );
The function
LandscapeEditorUtils::SetHeightmapData( landscape, Data );
no longer exists. In LandscapeEdit.h you can find the
LandscapeEdit::SetHeightData
This is defined by
SetHeightData(InMinX, InMinY, InMaxX, InMaxY, (uint16*)ImportHeightData->GetData(), 0, false, nullptr);
is this function the equivalent to SetHeightmapData? The engine crushes with this approach too.
Do you have any suggestions or workarounds for creating procedural landscapes, either from blueprint or code? We also checked out the approach of Christian Sparks (https://hippowombat.tumblr.com/post/...-ue4-420#notes), which is cool, but we need the landscape for runtime applications.
Thx!
reiti

Best way to indexing a matrix in opencv

Let say, A and B are matrices of the same size.
In Matlab, I could use simple indexing as below.
idx = A>0;
B(idx) = 0
How can I do this in OpenCV? Should I just use
for (i=0; ... rows)
for(j=0; ... cols)
if (A.at<double>(i,j)>0) B.at<double>(i,j) = 0;
something like this? Is there a better (faster and more efficient) way?
Moreover, in OpenCV, when I try
Mat idx = A>0;
the variable idx seems to be a CV_8U matrix (not boolean but integer).
You can easily convert this MATLAB code:
idx = A > 0;
B(idx) = 0;
// same as
B(A>0) = 0;
to OpenCV as:
Mat1d A(...)
Mat1d B(...)
Mat1b idx = A > 0;
B.setTo(0, idx) = 0;
// or
B.setTo(0, A > 0);
Regarding performance, in C++ it's usually faster (it depends on the enabled optimizations) to work on raw pointers (but is less readable):
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
Also note that in OpenCV there isn't any boolean matrix, but it's a CV_8UC1 matrix (aka a single channel matrix of unsigned char), where 0 means false, and any value >0 is true (typically 255).
Evaluation
Note that this may vary according to optimization enabled with OpenCV. You can test the code below on your PC to get accurate results.
Time in ms:
my results my results #AdrienDescamps
(OpenCV 3.0 No IPP) (OpenCV 2.4.9)
Matlab : 13.473
C++ Mask: 640.824 5.81815 ~5
C++ Loop: 5.24414 4.95127 ~4
Note: I'm not entirely sure about the performance drop with OpenCV 3.0, so I just remark: test the code below on your PC to get accurate results.
As #AdrienDescamps stated in comments:
It seems that the performance drop with OpenCV 3.0 is related to the OpenCL option, that is now enabled in the comparison operator.
C++ Code
#include <opencv2/opencv.hpp>
#include <iostream>
using namespace std;
using namespace cv;
int main()
{
// Random initialize A with values in [-100, 100]
Mat1d A(1000, 1000);
randu(A, Scalar(-100), Scalar(100));
// B initialized with some constant (5) value
Mat1d B(A.rows, A.cols, 5.0);
// Operation: B(A>0) = 0;
{
// Using mask
double tic = double(getTickCount());
B.setTo(0, A > 0);
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Mask: " << toc << endl;
}
{
// Using for loop
double tic = double(getTickCount());
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Loop: " << toc << endl;
}
getchar();
return 0;
}
Matlab Code
% Random initialize A with values in [-100, 100]
A = (rand(1000) * 200) - 100;
% B initialized with some constant (5) value
B = ones(1000) * 5;
tic
B(A>0) = 0;
toc
UPDATE
OpenCV 3.0 uses IPP optimization in the function setTo. If you have that enabled (you can check with cv::getBuildInformation()), you'll have a faster computation.
The answer of Miki is very good, but i just want to add some clarification about the performance problem to avoid any confusion.
It is true that the best way to implement an image filter (or any algorithm) with OpenCV is to use the raw pointers, as shown in the second C++ example of Miki (C++ Loop).
Using the at function is also correct, but significantly slower.
However, most of the time, you don't need to worry about that, and you can simply use the high level functions of OpenCV (first example of Miki , C++ Mask). They are well optimized, and will usually be almost as fast as a low level loop on pointers, or even faster.
Of course, there are exceptions (we just found one), and you should always test for your specific problem.
Now, regarding this specific problem :
The example here where the high level function was much slower (100x slower) than the low level loop is NOT a normal case, as it is demonstrated by the timings with other version/configuration of OpenCV, that are much lower.
The problem seems to be that when OpenCV3.0 is compiled with OpenCL, there is a huge overhead the first time a function that uses OpenCL is called. The simplest solution is to disable OpenCL at compile time, if you use OpenCV3.0 (see also here for other possible solutions if you are interested).

how to fix slow kmeans of opencv

i use kmeans() for project about bag of words and it is took a lot of time i mean if i have 600 image it took 40-50 mins.and i look source code and that part took most of time:
for( i = 0; i < N; i++ )///very very slow part because N*K is huge
{
sample = data.ptr<float>(i);
int k_best = 0;
double min_dist = DBL_MAX;
for( k = 0; k < K; k++ )
{
const float* center = centers.ptr<float>(k);
double dist = normL2Sqr_(sample, center, dims);
if( min_dist > dist )
{
min_dist = dist;
k_best = k;
}
}
compactness += min_dist;
labels[i] = k_best;
}
and i try but i cant manage to reduce that part ,is there way to make it more efficient it take 22-23 secs in loop and it cause long time to wait programs finish running like 40-50 mins and it cause i cant try other video sets or image sets in program.If there is better kmeans code at c++ that will help too and if there is way to reduce N(# of features) that will help too but K is dictionary size i cant reduce it. Thanks for helps from now.
The k-means implementation in OpenCV is very inefficient, and there are a number of tricks to improve performance that they do not sure. It would be considerable work to re-write it yourself.
The implementation in VLfeat offers better algorithms for k-means, but I don't know about the quality of the implementation.

Controlling the index variables in C++ AMP

I have just started trying C++ AMP and I decided to give it a shot with the current project I am working on. At some point, I have to build a distance matrix for the vectors I have and I have written the code below for this
unsigned int samplesize=samplelist.size();
unsigned int vs = samplelist.front().size();
vector<double> samplevec(samplesize*vs);
vector<double> distancevec(samplesize*samplesize,0);
it1=samplelist.begin();
for(int i=0 ; i<samplesize; ++i){
for(int j = 0 ; j<vs ; ++j){
samplevec[j + i*vs] = (*it1)[j];
}
++it1;
}
array_view<const double,2> samplearray(samplesize,vs,samplevec);
array_view<writeonly<double>,2> distances(samplesize,samplesize,distancevec);
parallel_for_each(distances.grid, [=](index<2> idx) restrict(direct3d){
double sqrsum=0;
double tempd=0;
for ( unsigned int i=0 ; i<vs ; ++i)
{
tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
sqrsum += tempd*tempd;
}
distances[idx]=sqrsum;
}
However, as you can see, this does not take into account the symmetry property of distance matrices. When I calculate sqrsum of matrices i and j, I don't want to do the same calculation again when the order of the i and j are reversed. Is there any way to accomplish this? I came up with the following trick, but I don't know if this would bump up the performance significantly
for ( unsigned int i=0 ; i<vs ; ++i)
{
if(idx.x<=idx.y){
break;
}
tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
sqrsum += tempd*tempd;
}
Can the if-condition do the job? Or do you think the if statement would hurt the performance unnecessarily? I couldn't came up with any alternative to it
BTW, I just noticed that the above written code does not work on my machine, whose gpu only supports single precision. Is there anything to do to get around that problem? Error message is as follows:
"runtime_exception: Concurrency;;parallel_for_each uses features unsupported by the selected accelerator.
ID3D11Device::CreateComputeShader: Shader uses double precision float ops which are not supported on the current device."
I think you can eliminate if-condition, if you would schedule only as many threads as you need, instead of scheduling entire rectangle that covers your output matrix. What you need is upper or lower triangle without diagonal, which you can calculate using arithmetic sequence.
The alternative would be to organize input data such that it is in two 1D vectors, each thread would read value from vector 1, then vector 2 and calculate distance and store it in one of the input vectors.
Finally, the error on double precision shows up, because the card you are using does not support double precision operations. Please check your card specification to confirm that. You can workaround it by switching to single precision type i.e. "float" in array_view template.