Optimization tips for a cuda code - c++

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. And now i want to rewrite a part of it in parallel for speedup.
this part of code is:
siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
means= mean(region(:));
M=return_step(region, means);
filt1=filt.*M;
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
and return_step function:
function M=return_step(X, means)
[a,b]=size(X);
for i=1:a
for j=1:b
if X(i,j)>=means
M(i,j)=1;
end
end
end
I wrote below kernel function:
__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
int idx=threadIdx.y*blockDim.x+threadIdx.x;
if(idx>=leng) return;
int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
double xval= x[ridx];
if (xval>=mean) m[idx]=filt[idx]*xval;
else m[idx]=0;
}
and then changed the MATLAB code as follow:
kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);
%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
means= mean(region(:));
GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
filt1= gather(GM);
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
my sequential code runs in 2.5s for a 330X200 image but the new parallel code's run time is 15s. I don't know why????
I need some advise for improving it. I am new in CUDA programming.

> help gather
...
X = GATHER(A) when A is a GPUArray, X is an array in the local workspace
with the data transferred from the GPU device.
....
filt1 = gather(GM) is copying GM from the GPU to the CPU in every step, which is very inefficient. You should move the entire computation inside the loop nest, or preferably the entire loop nest to the GPU kernel. Otherwise you can forget about any speedup.

My evaluation under Sobel filter shows the CPU outperforms GPU on small images. I think your image size is so small for comparison of CPU-GPU performance. Computation should be large enough to hide kernel and communication launch overhead.

Related

How Can I make it faster in c++11 with std::vector?

I have cv::Mat Mat_A and cv::Mat Mat_B both are (800000 X 512) floats
and below code is looks slow .
int rows = Mat_B.rows;
cv::Mat Mat_A = cv::repeat(img, rows, 1, Mat_A);
Mat_A = Mat_A - Mat_B
cv::pow(Mat_A,2,Mat_A)
cv::reduce(Mat_A, Mat_A, 1, CV_REDUCE_SUM);
cv::minMaxLoc(Mat_A, &dis, 0, &point, 0);
How Can I do this in std::vector ?
I think it should be faster.
In my 2.4 Ghz mabook pro it took 4 sec ? very slow.
I don't think you should use std::vector to do these operations. Image processing (CV aka Computer Vision) algorithms tend to be quite computationally heavy because there is so much data to deal with. OpenCV 2.0 C++ is highly optimized for this kind of operations, e.g. cv::Mat has a header and whenever a cv::Mat is copied with copy assignment or constructor, only the headers are copied with a pointer to the data. They use reference counting to keep track of instances. So memory management is done for you, and that's a good thing.
https://docs.opencv.org/2.4/doc/tutorials/core/mat_the_basic_image_container/mat_the_basic_image_container.html
You could try to compile without debug symbols, i.e. release vs debug. And you can also try to compile with optimization flags, e.g. for gcc -O3 which should reduce the size of your binary and speed up runtime operations. Maybe it might make a difference.
https://www.rapidtables.com/code/linux/gcc/gcc-o.html
Another thing you could try is to give your process a higher priority, i.e. the higher the priority, the less it the process yields the CPU. Again, that might not make a lot of difference, it all depends of other processes and their priorities, etc.
https://superuser.com/questions/42817/is-there-any-way-to-set-the-priority-of-a-process-in-mac-os-x
I hope that helps a bit.
Well your thinking is wrong.
Why your program is slow:
Your CPU have to loop through a lot of number and do calculation. This will make computation complexity high. That's why it's slow. Your program's speed is in proportion to size of Mat A and B. You can check this point by reducing/increasing the size of Mat A and B.
Can we accelerate it by std::vector
Sorry but it's no. Using std::vector will not reduce the calculation complexity. The math arthmetic of opencv is da "best", re-writing will only lead to slower code.
How to accelerate the calculation: you need to enable the acceleration options for opencv
you can see it at : https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options . Intel provide intel mkl library to accelerate the matrix calculation. You could try it first.
Personally, the easiest approach is to use the GPU. But your machine doesn't have GPU, so it's out of the scope here.
You keep iterating over the data over and over again to do independent operations on them.
Something like this iterates only once over the data.
//assumes Mat_B and img cv::Mat
using px_t = float;//you mentioned float so I'll assume both img and Mat_B use floats
int rows = Mat_B.rows;
cv::Mat output(1,rows, Mat_B.type());
auto output_ptr = output.ptr<px_t>(0);
auto img_ptr = img.ptr<px_t>(0);
int min_idx =0;
int max_idx =0;
px_t min_ele = std::numeric_limits<px_t>::max();
px_t max_ele = std::numeric_limits<px_t>::min();
for(int i = 0; i< rows; ++i)
{
output[i]=0;
auto mat_row = Mat_B.ptr<px_t>(i);
for(int j = 0; j< Mat_B.cols; ++j)
{
output[i] +=(img_ptr[j]-mat_row[j])*(img_ptr[j]-mat_row[j]);
}
if(output[i]<min_ele)
{
min_idx = i;
min_ele = output[i];
}
if(output[i]>max_ele)
{
max_idx = i;
max_ele = output[i];
}
}
While I am also not sure if it is faster you can do this, assuming Mat_B contains uchar
std::vector<uchar> array_B(Mat_B.rows* Mat_B.cols);
if(Mat_B.isContinuous())
array_B = Mat_B.data;

Realtime audio application, improving performance

I am currently writing a C++ real time audio application which roughly contains:
reading frames from a buffer
interpolating frames with the hermit interpolation here
filtering ever frame with two biquad filters (and updating their coefficients every frame)
a 3 band crossover containing 18 biquad calculations
a FreeVerb algorithm from the STK libary here
I think this should be handable for my PC but I get some buffer underflows every so often so I would like to improve the performance of my application. I have a bunch of question I hope you can answer me. :)
1) Operator Overloading
Instead of working directly with my flaot samples and doing calculations for every sample,
I pack my floats in a Frame class which contains the left and the right Sample. The class overloads some operators for addition, subtraction and multiplication with float.
The filters (biquad mostly) and the reverb works with floats and doesn't use this class but the hermite interpolator and every multiplication and addition for volume controll and mixing uses the class.
Does this has an impact on the performance and would it be better to work with left and right sample directly?
2) std::function
The callback function from the audio IO libary PortAudio calls a std::function. I use this to encapsulation everything related to PortAudio. So the "user" sets his own callback function with std::bind
std::bind( &AudioController::processAudio,
&(*this),
std::placeholders::_1,
std::placeholders::_2));
Since for every callback, the right function has to be found from the CPU (however this works...), does this have an impact and would it be better to define a class the user has to inherit from?
3) virtual functions
I use a class called AudioProcessor which declares a virtual function:
virtual void tick(Frame *buffer, int frameCout) = 0;
This function always processes a number of frames at once. Depending on the drive, 200 frames up to 1000 frames per call.
Within the signal processing path, I call this function 6 time from multiple derivated classes. I remember that this is done with lookup tables so the CPU knows exactly which function it has to call. So does the process of calling a "virtual" (derivated) function has an impact on the performance?
The nice thing about this is the structure in the source code but only using inlines maybe would have an performance improvement.
These are all questions for now. I have some more about Qt's event loop because I think that my GUI uses quite a bit of CPU time as well. But this is another topic I guess. :)
Thanks in advance!
These are all relevant function calls within the signal processing. Some of them are from the STK libary.
The biquad functions are from STK and should perform fine. This goes for the freeverb algorithm as well.
// ################################ AudioController Function ############################
void AudioController::processAudio(int frameCount, float *output) {
// CALCULATE LEFT TRACK
Frame * leftFrameBuffer = (Frame*) output;
if(leftLoaded) { // the left processor is loaded
leftProcessor->tick(leftFrameBuffer, frameCount); //(TrackProcessor::tick()
} else {
for(int i = 0; i < frameCount; i++) {
leftFrameBuffer[i].leftSample = 0.0f;
leftFrameBuffer[i].rightSample = 0.0f;
}
}
// CALCULATE RIGHT TRACk
if(rightLoaded) { // the right processor is loaded
// the rightFrameBuffer is allocated once and ensured to have enough space for frameCount Frames
rightProcessor->tick(rightFrameBuffer, frameCount); //(TrackProcessor::tick()
} else {
for(int i = 0; i < frameCount; i++) {
rightFrameBuffer[i].leftSample = 0.0f;
rightFrameBuffer[i].rightSample = 0.0f;
}
}
// MIX
for(int i = 0; i < frameCount; i++ ) {
leftFrameBuffer[i] = volume * (leftRightMix * leftFrameBuffer[i] + (1.0 - leftRightMix) * rightFrameBuffer[i]);
}
}
// ################################ AudioController Function ############################
void TrackProcessor::tick(Frame *frames, int frameNum) {
if(bufferLoaded && playback) {
for(int i = 0; i < frameNum; i++) {
// read from buffer
frames[i] = bufferPlayer->tick();
// filter coeffs
caltulateFilterCoeffs(lowCutoffFilter->tick(), highCutoffFilter->tick());
// filter
frames[i].leftSample = lpFilterL->tick(hpFilterL->tick(frames[i].leftSample));
frames[i].rightSample = lpFilterR->tick(hpFilterR->tick(frames[i].rightSample));
}
} else {
for(int i = 0; i < frameNum; i++) {
frames[i] = Frame(0,0);
}
}
// Effect 1, Equalizer
if(effsActive[0]) {
insEffProcessors[0]->tick(frames, frameNum);
}
// Effect 2, Reverb
if(effsActive[1]) {
insEffProcessors[1]->tick(frames, frameNum);
}
// Volume
for(int i = 0; i < frameNum; i++) {
frames[i].leftSample *= volume;
frames[i].rightSample *= volume;
}
}
// ################################ Equalizer ############################
void EqualizerProcessor::tick(Frame *frames, int frameNum) {
if(active) {
Frame lowCross;
Frame highCross;
for(int f = 0; f < frameNum; f++) {
lowAmp = lowAmpFilter->tick();
midAmp = midAmpFilter->tick();
highAmp = highAmpFilter->tick();
lowCross = highLPF->tick(frames[f]);
highCross = highHPF->tick(frames[f]);
frames[f] = lowAmp * lowLPF->tick(lowCross)
+ midAmp * lowHPF->tick(lowCross)
+ highAmp * lowAPF->tick(highCross);
}
}
}
// ################################ Reverb ############################
// This function just calls the stk::FreeVerb tick function for every frame
// The FreeVerb implementation can't realy be optimised so I will take it as it is.
void ReverbProcessor::tick(Frame *frames, int frameNum) {
if(active) {
for(int i = 0; i < frameNum; i++) {
frames[i].leftSample = reverb->tick(frames[i].leftSample, frames[i].rightSample);
frames[i].rightSample = reverb->lastOut(1);
}
}
}
// ################################ Buffer Playback (BufferPlayer) ############################
Frame BufferPlayer::tick() {
// adjust read position based on loop status
if(inLoop) {
while(readPos > loopEndPos) {
readPos = loopStartPos + (readPos - loopEndPos);
}
}
int x1 = readPos;
float t = readPos - x1;
Frame f = interpolate(buffer->frameAt(x1-1),
buffer->frameAt(x1),
buffer->frameAt(x1+1),
buffer->frameAt(x1+2),
t);
readPos += stepSize;;
return f;
}
// interpolation:
Frame BufferPlayer::interpolate(Frame x0, Frame x1, Frame x2, Frame x3, float t) {
Frame c0 = x1;
Frame c1 = 0.5f * (x2 - x0);
Frame c2 = x0 - (2.5f * x1) + (2.0f * x2) - (0.5f * x3);
Frame c3 = (0.5f * (x3 - x0)) + (1.5f * (x1 - x2));
return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}
inline Frame BufferPlayer::frameAt(int pos) {
if(pos < 0) {
pos = 0;
} else if (pos >= frames) {
pos = frames -1;
}
// get chunk and relative Sample
int chunk = pos/ChunkSize;
int chunkSample = pos%ChunkSize;
return Frame(leftChunks[chunk][chunkSample], rightChunks[chunk][chunkSample]);
}
Some suggestions on performance improvement:
Optimize Data Cache Usage
Review your functions that operate on a lot of data (e.g. arrays). The functions should load data into cache, operate on the data, then store back into memory.
The data should be organized to best fit into the data cache. Break up the data into smaller blocks if it doesn't fit. Search the web for "data driven design" and "cache optimizations".
In one project, performing data smoothing, I changed the layout of data and gained 70% performance.
Use Multiple Threads
In the big picture, you may be able to use at least three dedicated threads: input, processing and output. The input thread obtains the data and stores it in buffer(s); search the Web for "double buffering". The second thread gets data from the input buffer, processes it, then writes to an output buffer. The third thread writes data from the output buffer to the file.
You may also benefit from using threads for left and right samples. For example, while one thread is processing the left sample, another thread could be processing the right sample. If you could put the threads on different cores, you may see even more performance benefit.
Use the GPU processing
A lot of modern Graphics Processing Units (GPU) have a lot of cores that can process floating point values. Maybe you could delegate some of the filtering or analysis functions to the cores in the GPU. Be aware that this requires overhead and to gain the benefit, the processing part should be more computative than the overhead.
Reducing the Branching
Processors prefer to manipulate data over branching. Branching stalls the execution as the processor has to figure out where to get and process the next instruction. Some have large instruction caches that can contain small loops; but there is still a penalty for branching to the top of the loop again. See "Loop Unrolling". Also check your compiler optimizations and optimize high for performance. Many compilers will switch to loop unrolling for you, if the circumstances are correct.
Reduce the Amount of Processing
Do you need to process the entire sample or portions of it? For example, in video processing, much of the frame doesn't change only small portions. So the entire frame doesn't need to be processed. Can the audio channels be isolated so only a few channels are processed rather than the entire spectrum?
Coding to Help the Compiler Optimize
You can help the compiler with optimizations by using the const modifier. The compiler may be able to use different algorithms for variables that don't change versus ones that do. For example, a const value can be placed in the executable code, but a non-const value must be placed in memory.
Using static and const can help too. The static usually implies only one instance. The const implies something that doesn't change. So if there is only one instance of the variable that doesn't change, the compiler can place it into the executable or read-only memory and perform a higher optimization of the code.
Loading multiple variables at the same time can help too. The processor can place the data into the cache. The compiler may be able to use specialized assembly instructions for fetching sequential data.

FFTW vs Matlab FFT

I posted this on matlab central but didn't get any responses so I figured I'd repost here.
I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.
Here's the mex portion:
void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
// Check for wisdom
if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
mexPrintf("wisdom not loaded.\n");
} else {
mexPrintf("wisdom loaded.\n");
}
// Set FFTW Plan - use interleaved FFTW
fftw_plan plan_forward_d_buffer;
fftw_plan plan_forward_A_vec;
fftw_plan plan_backward_Ad_buffer;
fftw_complex *A_vec_fft;
fftw_complex *d_buffer_fft;
A_vec_fft = fftw_alloc_complex(n);
d_buffer_fft = fftw_alloc_complex(n);
// CREATE MASTER PLAN - Do this on an empty vector as creating a plane
// with FFTW_MEASURE will erase the contents;
// Use d_buffer
// This is somewhat dangerous because Ad_buffer is a vector; but it does not
// get resized so &Ad_buffer[0] should work
plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
// A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);
// Get A_vec_fft
fftw_execute(plan_forward_A_vec);
// Find initial direction - this is the initial residual
for (int i=0;i<n;i++) {
d_buffer[i] = b.value[i];
r_buffer[i] = b.value[i];
}
// Start CG iterations
norm_ro = norm(r_buffer);
double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {
// Find Ad - use fft
fftw_execute(plan_forward_d_buffer);
// Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
// has complex elements; Overwrite d_buffer_fft
for (int i=0;i<n;i++) {
d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
}
fftw_execute(plan_backward_Ad_buffer);
// Calculate r'*r
rtr_buffer = 0;
for (int i=0;i<n;i++) {
rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
}
// Calculate alpha
alpha = 0;
for (int i=0;i<n;i++) {
alpha = alpha + d_buffer[i]*Ad_buffer[i];
}
alpha = rtr_buffer/alpha;
// Calculate new x
for (int i=0;i<n;i++) {
x[i] = x[i] + alpha*d_buffer[i];
}
// Calculate new residual
for (int i=0;i<n;i++) {
r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
}
// Calculate beta
beta = 0;
for (int i=0;i<n;i++) {
beta = beta + r_buffer[i]*r_buffer[i];
}
beta = beta/rtr_buffer;
// Calculate new direction vector
for (int i=0;i<n;i++) {
d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
}
*total_counter = *total_counter+1;
if(*total_counter >= iteration_cutoff) {
// Set total_counter to -1, this indicates failure
*total_counter = -1;
break;
}
}
// Store Wisdom
fftw_export_wisdom_to_filename("cd.wis");
// Free fft alloc'd memory and plans
fftw_destroy_plan(plan_forward_d_buffer);
fftw_destroy_plan(plan_forward_A_vec);
fftw_destroy_plan(plan_backward_Ad_buffer);
fftw_free(A_vec_fft);
fftw_free(d_buffer_fft);
};
Here's the matlab portion:
% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once
% Find initial direction - this is the initial residual
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
d(i) = b(i);
r(i) = b(i);
end
% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);
while(norm(r)/norm_ro > 10^-6)
% Find Ad - use fft
Ad_buffer = ifft(A_vec_fft.*fft(d));
% Calculate rtr_buffer
rtr_buffer = r'*r;
% Calculate alpha
alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));
% Calculate new x
x = x + alpha*d(1:n);
% Calculate new residual
r = r - alpha*Ad_buffer(1:n);
% Calculate beta
beta = r'*r/(rtr_buffer);
% Calculate new direction vector
d(1:n) = r + beta*d(1:n);
% Update counter
total_counter = total_counter+1;
end
In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks
A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:
Based on the code you have, I can see two explanations for the speed difference:
the speed difference is explained by differences in levels of optimization of the FFT
the while loop in MATLAB is executed a significantly smaller number of times
I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)
Now, regarding FFT speed comparison:
Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
This is classic performance gain thanks to low-level and architecture-specific optimization.
Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.
FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.
I have found the following comment on the MathWorks website [1]:
Note on large powers of 2: For FFT dimensions that are powers of
2, between 2^14 and 2^22, MATLAB software uses special preloaded
information in its internal database to optimize the FFT computation.
No tuning is performed when the dimension of the FTT is a power of 2,
unless you clear the database using the command fftw('wisdom', []).
Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.
[1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)
EDIT: #wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!
First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.
In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).
In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.
Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.
X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
then
for (size_t i=0; i<N; ++i) {
X[i][0] = Xre[i];
X[i][1] = Xim[i];
}
then
for (size_t i=0; i<N; ++i) {
Yre[i] = Y[i][0];
Yim[i] = Y[i][1];
}
Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.
I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.
In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.
EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.
One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.
Cheers.
p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.

CUDA Thrust slow when operating large vectors on my machine

I'm a CUDA beginner and reading on some thrust tutorials.I write a simple but terribly organized code and try to figure out the acceleration of thrust.(is this idea correct?). I try to add two vectors (with 10000000 int) to another vector, by adding array on cpu and adding device_vector on gpu.
Here is the thing:
#include <iostream>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#define N 10000000
int main(void)
{
float time_cpu;
float time_gpu;
int *a = new int[N];
int *b = new int[N];
int *c = new int[N];
for(int i=0;i<N;i++)
{
a[i]=i;
b[i]=i*i;
}
clock_t start_cpu,stop_cpu;
start_cpu=clock();
for(int i=0;i<N;i++)
{
c[i]=a[i]+b[i];
}
stop_cpu=clock();
time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
std::cout<<"Time to generate (CPU):"<<time_cpu<<std::endl;
thrust::device_vector<int> X(N);
thrust::device_vector<int> Y(N);
thrust::device_vector<int> Z(N);
for(int i=0;i<N;i++)
{
X[i]=i;
Y[i]=i*i;
}
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
thrust::transform(X.begin(), X.end(),
Y.begin(),
Z.begin(),
thrust::plus<int>());
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime,start,stop);
std::cout<<"Time to generate (thrust):"<<elapsedTime<<std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
getchar();
return 0;
}
The CPU results appear really fast, But gpu runs REALLY slow on my machine(i5-2320,4G,GTX 560 Ti), CPU time is about 26,GPU time is around 30! Did I just do the thrust wrong with stupid errors in my code? or was there a deeper reason?
As a C++ rookie, I checked my code over and over and still got a slower time on GPU with thrust, so I did some experiments to show the difference of calculating vectorAdd with five different approaches.
I use windows API QueryPerformanceFrequency() as unified time measurement method.
Each of the experiments looks like this:
f = large_interger.QuadPart;
QueryPerformanceCounter(&large_interger);
c1 = large_interger.QuadPart;
for(int j=0;j<10;j++)
{
for(int i=0;i<N;i++)//CPU array adding
{
c[i]=a[i]+b[i];
}
}
QueryPerformanceCounter(&large_interger);
c2 = large_interger.QuadPart;
printf("Time to generate (CPU array adding) %lf ms\n", (c2 - c1) * 1000 / f);
and here is my simple __global__ function for GPU array adding:
__global__ void add(int *a, int *b, int *c)
{
int tid=threadIdx.x+blockIdx.x*blockDim.x;
while(tid<N)
{
c[tid]=a[tid]+b[tid];
tid+=blockDim.x*gridDim.x;
}
}
and the function is called as:
for(int j=0;j<10;j++)
{
add<<<(N+127)/128,128>>>(dev_a,dev_b,dev_c);//GPU array adding
}
I add vector a[N] and b[N] to vector c[N] for a loop of 10 times by:
add array on CPU
add std::vector on CPU
add thrust::host_vector on CPU
add thrust::device_vector on GPU
add array on GPU. and here is the result
with N=10000000
and I get results:
CPU array adding 268.992968ms
CPU std::vector adding 1908.013595ms
CPU Thrust::host_vector adding 10776.456803ms
GPU Thrust::device_vector adding 297.156610ms
GPU array adding 5.210573ms
And this confused me, I'm not familiar with the implementation of template library. Did the performance really differs so much between containers and raw data structures?
Most of the execution time is being spent in your loop that is initializing X[i] and Y[i]. While this is legal, it's a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, then copy those to the device. As a test, modify your code like this (right after the loop where you are initializing the device vectors X[i] and Y[i]):
} // this is your line of code
std::cout<< "Starting GPU run" <<std::endl; //add this line
cudaEvent_t start, stop; //this is your line of code
You will then see that the GPU timing results appear almost immediately after that added line prints out. So all of the time you're waiting is spent in initializing those device vectors directly from host code.
When I run this on my laptop, I get a CPU time of about 40 and a GPU time of about 5, so the GPU is running about 8 times faster than the CPU for the sections of code you are actually timing.
If you create X and Y as host vectors, and then create analogous d_X and d_Y device vectors, the overall execution time will be shorter, like so:
thrust::host_vector<int> X(N);
thrust::host_vector<int> Y(N);
thrust::device_vector<int> Z(N);
for(int i=0;i<N;i++)
{
X[i]=i;
Y[i]=i*i;
}
thrust::device_vector<int> d_X = X;
thrust::device_vector<int> d_Y = Y;
and change your transform call to:
thrust::transform(d_X.begin(), d_X.end(),
d_Y.begin(),
Z.begin(),
thrust::plus<int>());
OK so you've now indicated that the CPU run measurement is faster than the GPU measurement. Sorry I jumped to conclusions. My laptop is an HP laptop with a 2.6GHz core i7 and a Quadro 1000M gpu. I'm running centos 6.2 linux. A few comments: if you're running any heavy display tasks on your GPU, that can detract from performance. Also, when benchmarking these things it's common practice to use the same mechanism for comparison, you can use cudaEvents for both if you want, it can time CPU code the same as GPU code. Also, it's common practice with thrust to do a warm up run that is untimed, then repeat the test for a measurement, and likewise it's common practice to run the test 10 times or more in a loop, then divide to get an average. In my case, I can tell the clocks() measurement is pretty coarse because successive runs will give me 30, 40 or 50. On the GPU measurement I get something like 5.18256. Some of these things may help, but I can't say exactly why your results and mine differ so much (on the GPU side).
OK I did another experiment. The compiler will make a big difference on CPU side. I compiled with -O3 switch and the CPU time dropped to 0. Then I converted the CPU timing measurement from the clocks() method to cudaEvents, and I got a CPU measured time of 12.4 (with -O3 optimization) and still 5.1 on GPU side.
Your mileage will vary based on timing method and which compiler you are using on the CPU side.
First, Y[i]=i*i; does not fit in an integer for 10M elements. Integers holds roughly 1e10 and your code needs 1e14.
Second, it looks like the timing of transform is correct and should be faster than the CPU, regardless of which library you're using. Robert's suggestion to initialize vectors on CPU and then transfer to GPU is a good one for this case.
Third, since we can't do the integer multiple, below is some simpler CUDA library code (using ArrayFire that I work on) to do similar with floats, for your benchmarking:
int n = 10e6;
array x = array(seq(n));
array y = x * x;
timer t = timer::tic();
array z = x + y;
af::eval(z); af::sync();
printf("elapsed seconds: %g\n", timer::toc( t));
Good luck!
I am running similar test recently using CUDA Thrust on my Quadro 1000m. I use the thrust::sort_by_key as a benchmark to test its performance and the result is too good to convince my boos.It takes 100+ms to sort 512MB pairs.
For your problem, I am confused for 2 things.
(1) Why you multiple this time_cpu by 1000? Without the 1000, it is already in seconds.
time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
(2) And, by mentioning 26, 30, 40, do you mean seconds or ms? The 'cudaEvent' report elapsed time in 'ms' not 's'.

Help with code optimization

I've written a little particle system for my 2d-application. Here is raining code:
// HPP -----------------------------------
struct Data
{
float x, y, x_speed, y_speed;
int timeout;
Data();
};
std::vector<Data> mData;
bool mFirstTime;
void processDrops(float windPower, int i);
// CPP -----------------------------------
Data::Data()
: x(rand()%ScreenResolutionX), y(0)
, x_speed(0), y_speed(0), timeout(rand()%130)
{ }
void Rain::processDrops(float windPower, int i)
{
int posX = rand() % mWindowWidth;
mData[i].x = posX;
mData[i].x_speed = WindPower*0.1; // WindPower is float
mData[i].y_speed = Gravity*0.1; // Gravity is 9.8 * 19.2
// If that is first time, process drops randomly with window height
if (mFirstTime)
{
mData[i].timeout = 0;
mData[i].y = rand() % mWindowHeight;
}
else
{
mData[i].timeout = rand() % 130;
mData[i].y = 0;
}
}
void update(float windPower, float elapsed)
{
// If this is first time - create array with new Data structure objects
if (mFirstTime)
{
for (int i=0; i < mMaxObjects; ++i)
{
mData.push_back(Data());
processDrops(windPower, i);
}
mFirstTime = false;
}
for (int i=0; i < mMaxObjects; i++)
{
// Sleep until uptime > 0 (To make drops fall with randomly timeout)
if (mData[i].timeout > 0)
{
mData[i].timeout--;
}
else
{
// Find new x/y positions
mData[i].x += mData[i].x_speed * elapsed;
mData[i].y += mData[i].y_speed * elapsed;
// Find new speeds
mData[i].x_speed += windPower * elapsed;
mData[i].y_speed += Gravity * elapsed;
// Drawing here ...
// If drop has been falled out of the screen
if (mData[i].y > mWindowHeight) processDrops(windPower, i);
}
}
}
So the main idea is: I have some structure which consist of drop position, speed. I have a function for processing drops at some index in the vector-array. Now if that's first time of running I'm making array with max size and process it in cycle.
But this code works slower that all another I have. Please, help me to optimize it.
I tried to replace all int with uint16_t but I think it doesn't matter.
Replacing int with uint16_t shouldn't do any difference (it'll take less memory, but shouldn't affect running time on most machines).
The shown code already seems pretty fast (it's doing only what it's needed to do, and there are no particular mistakes), I don't see how you could optimize it further (at most you could remove the check on mFirstTime, but that should make no difference).
If it's slow it's because of something else. Maybe you've got too many drops, or the rest of your code is so slow that update gets called little times per second.
I'd suggest you to profile your program and see where most time is spent.
EDIT:
one thing that could speed up such algorithm, especially if your system hasn't got an FPU (! That's not the case of a personal computer...), would be to replace your floating point values with integers.
Just multiply the elapsed variable (and your constants, like those 0.1) by 1000 so that they will represent milliseconds, and use only integers everywhere.
Few points:
Physics is incorrect: wind power should be changed as speed makes closed to wind speed, also for simplicity I would assume that initial value of x_speed is the speed of the wind.
You don't take care the fraction with the wind at all, so drops getting faster and faster. but that depends on your want to model.
I would simply assume that drop fails in constant speed in constant direction because this is really what happens very fast.
Also you can optimize all this very simply as you don't need to solve motion equation using integration as it can be solved quite simply directly as:
x(t):= x_0 + wind_speed * t
y(t):= y_0 - fall_speed * t
This is the case of stable fall when the gravity force is equal to friction.
x(t):= x_0 + wind_speed * t;
y(t):= y_0 - 0.5 * g * t^2;
If you want to model drops that fall faster and faster.
Few things to consider:
In your processDrops function, you pass in windPower but use some sort of class member or global called WindPower, is that a typo? If the value of Gravity does not change, then save the calculation (i.e. mult by 0.1) and use that directly.
In your update function, rather than calculating windPower * elapsed and Gravity * elapsed for every iteration, calculate and save that before the loop, then add. Also, re-organise the loop, there is no need to do the speed calculation and render if the drop is out of the screen, do the check first, and if the drop is still in the screen, then update the speed and render!
Interestingly, you never check to see if the drop is out of the screen interms of it's x co-ordinate, you check the height, but not the width, you could save yourself some calculations and rendering time if you did this check as well!
In loop introduce reference Data& current = mData[i] and use it instead of mData[i]. And use this reference instead of index also in procesDrops.
BTW I think that consulting mFirstTime in processDrops serves no purpose because it will never be true. Hmm, I missed processDrops in initialization loop. Never mind this.
This looks pretty fast to me already.
You could get some tiny speedup by removing the "firsttime" code and putting it in it's own functions to call once rather that testing every calls.
You are doing the same calculation on lots of similar data so maybe you could look into using SSE intrinsics to process several items at once. You'l likely have to rearrange your data structure for that though to be a structure of vectors rather than a vector od structures like now. I doubt it would help too much though. How many items are in your vector anyway?
It looks like maybe all your time goes into ... Drawing Here.
It's easy enough to find out for sure where the time is going.