I just learned stream technique in CUDA, and I tried it. Howerver undesired result returns, namely, the streams are not parallel. (On GPU Tesla M6, OS Red Hat Enterprise Linux 8)
I have a data matrix with size (5,2048), and a kernel to process the matrix.
My plan is to decompose the data in 'nStreams=4' sectors and use 4 streams to parallel the kernel execution.
Part of my code is like the following:
int rows = 5;
int cols = 2048;
int blockSize = 32;
int gridSize = (rows*cols) / blockSize;
dim3 block(blockSize);
dim3 grid(gridSize);
int nStreams = 4; // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
checkCudaErrors(cudaStreamCreate(&streams[ii]));
}
int streamSize = rows * cols / nStreams;
dim3 streamGrid = streamSize/blockSize;
for(int jj=0;jj<nStreams;jj++){
int offset = jj * streamSize;
Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
} // d_Data is the matrix on gpu
Visual Profiler result shows that 4 different streams are not parallel. Stream 13 is the first to work and stream 16 is the last. There is 12.378us between stream 13 and stream 14. And each kernel execution lasts around 5us. In the line of 'Runtime API' above, it says 'cudaLaunch'.
Could you give me some advice? Thanks!
(I don't know how to upload pictures in stackoverflow, so I just describe the result in words.)
First of all, there is no guarantee that stuff launched in separate streams will actually be executed on the GPU in parallel. As pointed out in the programming guide, using multiple streams merely opens up the possibility, you cannot rely on it actually happening. It's up to the driver to decide.
Apart from that, your Tesla M6 has 12 multiprocessors if I'm not mistaken. Each of these 12 Maxwell multiprocessors can hold a maximum of 32 resident blocks. This brings the total maximum number of blocks resident on the entire device to 384. You're launching 320 blocks of 32 threads each. That alone doesn't leave all that much space and you're probably using more than 32 registers per thread so the GPU will be quite full with a single one of these launches, which is most likely why the driver chooses not to run another kernel in parallel.
Parallel kernel launches mainly make sense when you have, e.g., a bunch of small kernels that do different stuff which could run next to each other on separate multiprocessors. It seems that your workload could easily fill the entire device. What exactly are you hoping to achieve by running multiple kernels in parallel? Why are you working with such tiny blocks? Would it not make more sense to launch the whole thing as one big kernel with larger blocks? Normally, you'd want to have at least a couple warps per block. See, e.g., this question for more: How do I choose grid and block dimensions for CUDA kernels? If you're using shared memory, you'll also want at least two blocks per multiprocessor as you otherwise won't even be able to use all of it on some GPUs (which, e.g., offer 96 KiB shared memory per multiprocessor but each block can only have max 48 KiB of that)…
To add to the existing answer (which is completely correct), consider the following trivially complete version of the code you have posted in your question:
__global__
void Mykernel(float* data, int size)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for(; tid < size; tid+= blockDim.x * gridDim.x) data[tid] = 54321.f;
}
int main()
{
int rows = 2048;
int cols = 2048;
int blockSize = 32;
dim3 block(blockSize);
int nStreams = 4; // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
cudaStreamCreate(&streams[ii]);
}
float* d_Data;
cudaMalloc(&d_Data, sizeof(float) * rows * cols);
int streamSize = rows * cols / nStreams;
dim3 streamGrid = dim3(4);
for(int jj=0;jj<nStreams;jj++){
int offset = jj * streamSize;
Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
} // d_Data is the matrix on gpu
cudaDeviceSynchronize();
cudaDeviceReset();
}
Note two differences -- the number of blocks launched per kernel is reduced, and the amount of total computation per thread is increased by setting rows to 2048. The kernel itself contains a grid-stride loop which allows each thread to process multiple inputs, ensuring that the whole input dataset is processed no matter how many total blocks/threads are launched.
Profiling on a similar Maxwell GPU to your device shows this:
i.e. the kernels do overlap. Now let's reduce the problem size back to the size specified in your question (rows = 5):
The kernels no longer overlap. Why? Because driver and device latency is high enough, and the execution time of each kernel short enough that there is no time for execution overlap to occur, even when device resources would otherwise allow it. So beyond the resource requirement limitations described in the other answer, the volume of computation must be large enough to offset the fixed latency associated with scheduling a kernel launch within a stream.
Finally I would suggest that the correct approach to setting up a stream based concurrent execution scheme should look something like this:
int blockSize = 32;
dim3 block(blockSize);
int blocksperSM, SMperGPU = 13; // GPU specific
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&blocksperSM, Mykernel, blockSize, 0); // kernel specific
dim3 streamGrid = blocksperSM * (SMperGPU / nStreams); // assume SMperGPU >> nstreams
Here, the idea is that the number of available SMs are (roughly) equally divided amongst the streams, and the number of blocks which maximally occupy each SM for the selected block size is obtained for the kernel via the occupancy API.
This profiles as follows:
which yields both overlap, and short execution times by correctly matching the resource requirements of the kernel to the capacity of the GPU for the case with rows = 2048.
Related
how to determine block size and grid size automatically for 2D array (e.g. image processing) in CUDA?
CUDA has cudaOccupancyMaxPotentialBlockSize() function to calculate block size for cuda kernel functions automatically. see here. In this case, it works well for 1D array.
For my case, I have a 640x480 image.
How to determine the block/grid size?
I use:
////image size: 640x480
int x_min_grid_size, x_grid_size, x_block_size;
int y_min_grid_size, y_grid_size, y_block_size;
cudaOccupancyMaxPotentialBlockSize
(
&x_min_grid_size, &x_block_size,
my_cuda_kernel,
0, image.width()
);
cudaOccupancyMaxPotentialBlockSize
(
&y_min_grid_size, &y_block_size,
my_cuda_kernel,
0, image.height()
);
x_grid_size = (image.width() + x_block_size - 1) / x_block_size;
y_grid_size = (image.height() + y_block_size - 1) / y_block_size;
dim3 grid_dim(x_grid_size, y_grid_size);
dim3 block_dim(x_block_size, y_block_size);
my_cuda_kernel<<<grid_dim, block_dim>>>(<arguments...>)
////check cuda kernel function launch error
cudaError_t error = cudaGetLastError();
if(cudaSuccess != error)
{
std::cout<<"CUDA Error! "<<cudaGetErrorString(error)<<std::endl;
exit(1);
}
cudaDeviceSynchronize();
Question 1
Can I calculate block/grid size using this method?
For this code, I got an error after the kernel function launched.
CUDA Error! invalid configuration arguments
If I set x_block_size = 32; y_block_size = 32 manually, it works and has no error.
Can I ask why CUDA get invalid configuration arguments error message? It seems that I cannot use cudaOccupancyMaxPotentialBlockSize() directly for 2D array?
Potential Solution
I got an idea about the potential solution:
What if I calculate thread number first, and then use cudaOccupancyMaxPotentialBlockSize() calculate block size for 2D array:
////total_thread_num = 640x480 = 307200
int total_thread_num = image.width * image.height;
////compute block/grid size
int min_grid_size, grid_size, block_size;
cudaOccupancyMaxPotentialBlockSize
(
&min_grid_size, &block_size,
my_cuda_kernel,
0, total_thread_num
);
grid_size = (total_thread_num + block_size - 1) / block_size;
//launch CUDA kernel function
my_cuda_kernel<<<grid_size, block_size>>>(<arguments...>);
In my_cuda_kernel, it computes the corresponding index based on image size:
__global__ void my_cuda_kernel()
{
//compute 2D index based on 1D index;
unsigned int idx = BlockIdx.x * blockDim.x + threadIdx.x;
unsigned int row_idx = idx / image.width;
unsigned int col_idx = idx % image_width;
/*kernel function code*/
}
Question 2
If the method in Question 1 is not feasible, can I use the method above?
Question 1 Can I calculate block/grid size using this method?
No.
It is important to remember than these API calls provide the occupancy maximizing number of threads per block and not the block dimensions. If you run the API twice in each direction, you will likely get an illegal block size when the two values are combined. For example, if the occupancy maximizing thread count for a kernel was 256, then you could wind up with a 256 x 256 block size, which is far larger than 1024 total threads per block, thus the launch failure.
Question 2 If the method in Question 1 is not feasible, can I use the method above?
In principle, that should work, although you are taking a small performance penalty because the integer modulo operation isn't particularly fast on the GPU. Alternatively, you could calculate a 2D block size which satisfies your needs from the maximum threads per block return by the API.
For example, if you just want blocks with 32 threads in the block dimension which you will map to the major order of your data (for memory coalescing), then just divide the thread count by 32 (noting that the API will always return a round multiple of 32 threads per block because that is the warp size). So, as an example, if the threads per block return from the API was 384, then your block size would be 32 x 12.
If you really want some sort of tiling scheme which uses square blocks, then it is pretty easy to work out that only 64 (8 x 8), 256 (16 x 16), 576 (24 x 24) and 1024 (32 x 32) are the feasible block sizes which are both square numbers and round multiples of 32. In that case you probably want to select the larger block size which is less than or equal to the total thread count returned by the API.
Ultimately how you choose to do this will depend on the requirements of your kernel code. But it certainly is possible to devise a scheme for 2D block dimensioning which is compatible with the block sizing APIs which CUDA currently exposes
I am trying to GPU accelerate an algorithm where I receive an asynchronous stream of particles in 3D space $p=[x,y,t]$. Each vector $p_n$ needs to be multiplied by a bunch of transformation matrices. Since these transformations are independent of each other they can happen in parallel, so I have written a CUDA kernel to do that. It works well, but of course for each incoming $p_n$ I end up launching the CUDA kernel anew. Launching a CUDA kernels carries a major time penalty, and thus I lose the advantage of GPU acceleration. So my question is, can I keep the kernel open and stream the particles to it somehow?
In case it's any help here is my current kernel:
__global__
void project(float *projection_matrix, float *vector, float *output_matrix) {
int col_index = blockIdx.x * blockDim.x + threadIdx.x;
int row_index = blockIdx.y * blockDim.x + threadIdx.y;
int output_index = (col_index*3 + threadIdx.y);
int transform_first_element = col_index * 9 + threadIdx.y * 3;
int stride = blockDim.x*blockDim.y*gridDim.x;
while (output_index < (NUMBER_OF_TRANSFORMS * 3)) {
output_matrix[output_index] = projection_matrix[transform_first_element]*vector[0]+ projection_matrix[(transform_first_element+1)]*vector[1] + projection_matrix[(transform_first_element+2)]*vector[2];
output_index += stride;
}
}
and this is where I call it:
...
project <<<num_blocks_dim, block_dim >>> (transformationList, inputVector, outputMatrix);
cudaDeviceSynchronize();
...
You'll need to batch the requests up into a larger block and invoke a kernel on many particles. You can likely use the third dimension of the kernel to iterate over them. One way to do this is to accumulate incoming particles while the kernel is running. If you do not get enough particles to justify the kernel launch, process them on the CPU.
If the particles are being produced on the GPU, you have the option to launch a kernel from a kernel with newer versions of CUDA, but you still need a pretty large block to make that win.
If these are coming from the CPU and then going back to the CPU, I'd be surprised if you can make it pay off at all unless the number of matrices is pretty large. (Comparing to well optimized SIMD CPU code.)
I was trying to made a reduce sum, with PyOpenCL, similar to the example: https://dournac.org/info/gpu_sum_reduction . I'm trying to sum a vector with all values 1. The result should be 16384 in the first element.
However it seems that just some points are being gathered. Is it necessary a local index? Is there any race condition (when I run it twice the result is not the same)? Whats wrong with the following code?
import numpy as np
import pyopencl as cl
def readKernel(kernelFile):
with open(kernelFile, 'r') as f:
data=f.read()
return data
a_np = np.random.rand(128*128).astype(np.float32)
a_np=a_np.reshape((128,128))
print(a_np.shape)
device = cl.get_platforms()[0].get_devices(cl.device_type.GPU)[0]
print(device)
ctx=cl.Context(devices=[device])
#ctx = cl.create_some_context() #ask which context to use
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
prg = cl.Program(ctx,readKernel("kernel2.cl")).build()
prg.test(queue, a_np.shape, None, a_g)
cl.enqueue_copy(queue, a_np, a_g).wait()
np.savetxt("teste2.txt",a_np,fmt="%i")
The kernel is:
__kernel void test(__global float *count){
int id = get_global_id(0)+get_global_id(1)*get_global_size(0);
int nelements = get_global_size(0)*get_global_size(1);
count[id] = 1;
barrier(CLK_GLOBAL_MEM_FENCE);
for (int stride = nelements/2; stride>0; stride = stride/2){
barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
if (id < stride){
int s1 = count[id];
int s2 = count[id+stride];
count[id] = s1+s2;
}
}
barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
}
The problem is that your kernel is implemented to do reduction within one workgroup and there is implicitely schedulled many workgroups.
Depending on the GPU there is different number of maximum work items per workgroup. For Nvidia that is 1024, AMD and Intel 256 (Intel in older GPUs had 512).
Lets assume for this example that maximum work items per workgroup on your GPU is 256. In this case the maximum 2d worgroup size can be 16x16, so if you use that size of your matrix your kernel will return correct result. Using the original size 128x128 and not specifying local size when scheduling the kernel the implementation calculates that for you and you are getting global size 128x128 and local size (very likely) 16x16 which means 8 worgroups are being scheduled.
In the current kernel each workgroup is starting calculation from different id but the indices are reduced until 0 so you have race condition hence different results each run.
You have 2 options to fix this:
Rewrite your kernel to calculate everything within one workgroup and schedule it with global,local size: (16x16),(16,16) or whatever your max work items per workgroup device has
Use global,local size: (128x128),(16x16) and each workgroup will calculate its result which then on the cpu side will have to be sum up for each workgroup to get the final result.
For 128x128 the first option will be preferred as it should perform faster and should be more straightforward to implement.
I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.
I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.
Here is the simple project (I also attached the complete project form Visual Studio):
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"
#include <ctime>
#include <iostream>
using namespace std;
const int height = 6000;
const int width = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};
int main()
{
IppiSize size = {width, height};
double start = clock();
for (int i = 0; i < 200; i++)
ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size);
double end = clock();
double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);
cout << douration << endl;
cin.get();
return 0;
}
I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.
I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).
I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?
I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"
#include <ctime>
#include <iostream>
using namespace std;
#include <omp.h>
const int height = 5000;
const int width = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};
int main()
{
IppiSize size = {width, height};
double start = clock();
IppiSize blockSize = {width, height / 4};
const int NUM_BLOCK = 4;
omp_set_num_threads(NUM_BLOCK);
Ipp32f* in;
Ipp32f* out;
// ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);
#pragma omp parallel \
shared(mInput_image, mOutput_image, blockSize) \
private(in, out)
{
int id = omp_get_thread_num();
int step = blockSize.width * blockSize.height * id;
in = mInput_image + step;
out = mOutput_image + step;
ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
}
double end = clock();
double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);
cout << douration << endl;
cin.get();
return 0;
}
The results were the same, again, no gain of performance.
Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it?
Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?
The Computers I tried it on is based on Core i7 4770k (Haswell).
Here is a link to the Project in Visual Studio 2013.
Thank You.
Your images occupy 200 MB in total (2 x 5000 x 5000 x 4 bytes). Each block therefore consists of 50 MB of data. This is more than 6 times than the size of your CPU's L3 cache (see here). Each AVX vector multiplication operates on 256 bits of data, which is half a cache line, i.e. it consumes one cache line per vector instruction (half a cache line for each argument). A vectorised multiplication on Haswell has a latency of 5 cycles and the FPU can retire two such instructions per cycle (see here). The memory bus of i7-4770K is rated at 25.6 GB/s (theoretical maximum!) or no more than 430 million cache lines per second . The nominal speed of the CPU is 3.5 GHz. The AVX part is clocked a bit lower, let's say at 3.1 GHz. At that speed, it takes an order of magnitude more cache lines per second to fully feed the AVX engine.
In those conditions, a single thread of vectorised code saturates almost fully the memory bus of your CPU. Adding a second thread might result in a very slight improvement. Adding further threads only results in contentions and added overhead. The only way to speed up such a calculation is to increase the memory bandwidth:
run on a NUMA system with more memory controllers and therefore higher aggregate memory bandwidth, e.g. a multisocket server board;
switch to a different architecture with much higher memory bandwidth, e.g. Intel Xeon Phi or a GPGPU.
From some researching on my own, it looks like your total CPU cache is around 8MB. 6000*4/4 (6000 floats split into blocks of 4) is 6MB. Multiply this by 2 (in and out), and you're outside of the cache.
I haven't tested this, but increasing the number of blocks should increase the performannce. Try 8 to start out with (your CPU siports hyperthreading to 8 virtual cores).
Currently, each of the different processes spawned on OpenMP is having cache conflicts and having to (re)load from main memory. Reducing the size of the blocks can help with this. Having distinct cahces would effectively increase the size of your cache, but it seems thats not an option.
If you're just doing this as a proof of principle, you may want to test this by running it on your graphics card. Although, that can be even harder to implement properly.
If you run with hyperthread enabled you should try the openmp version of ipp with 1 thread per core and set omp_places=cores if ipp doesn't do it automatically. If you use Cilk_ ipp try varying cilk_workers.
You might try a test case large enough to span multiple 4kb pages. Then additional factors come into play. Ideally, ipp will put the threads to work on different pages. On Linux (or Mac?) transparent huge pages should kick in. On Windows, haswell CPU introduced hardware page prefetch which should reduce but not eliminate importance of thp.
I posted this on matlab central but didn't get any responses so I figured I'd repost here.
I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.
Here's the mex portion:
void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
// Check for wisdom
if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
mexPrintf("wisdom not loaded.\n");
} else {
mexPrintf("wisdom loaded.\n");
}
// Set FFTW Plan - use interleaved FFTW
fftw_plan plan_forward_d_buffer;
fftw_plan plan_forward_A_vec;
fftw_plan plan_backward_Ad_buffer;
fftw_complex *A_vec_fft;
fftw_complex *d_buffer_fft;
A_vec_fft = fftw_alloc_complex(n);
d_buffer_fft = fftw_alloc_complex(n);
// CREATE MASTER PLAN - Do this on an empty vector as creating a plane
// with FFTW_MEASURE will erase the contents;
// Use d_buffer
// This is somewhat dangerous because Ad_buffer is a vector; but it does not
// get resized so &Ad_buffer[0] should work
plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
// A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);
// Get A_vec_fft
fftw_execute(plan_forward_A_vec);
// Find initial direction - this is the initial residual
for (int i=0;i<n;i++) {
d_buffer[i] = b.value[i];
r_buffer[i] = b.value[i];
}
// Start CG iterations
norm_ro = norm(r_buffer);
double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {
// Find Ad - use fft
fftw_execute(plan_forward_d_buffer);
// Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
// has complex elements; Overwrite d_buffer_fft
for (int i=0;i<n;i++) {
d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
}
fftw_execute(plan_backward_Ad_buffer);
// Calculate r'*r
rtr_buffer = 0;
for (int i=0;i<n;i++) {
rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
}
// Calculate alpha
alpha = 0;
for (int i=0;i<n;i++) {
alpha = alpha + d_buffer[i]*Ad_buffer[i];
}
alpha = rtr_buffer/alpha;
// Calculate new x
for (int i=0;i<n;i++) {
x[i] = x[i] + alpha*d_buffer[i];
}
// Calculate new residual
for (int i=0;i<n;i++) {
r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
}
// Calculate beta
beta = 0;
for (int i=0;i<n;i++) {
beta = beta + r_buffer[i]*r_buffer[i];
}
beta = beta/rtr_buffer;
// Calculate new direction vector
for (int i=0;i<n;i++) {
d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
}
*total_counter = *total_counter+1;
if(*total_counter >= iteration_cutoff) {
// Set total_counter to -1, this indicates failure
*total_counter = -1;
break;
}
}
// Store Wisdom
fftw_export_wisdom_to_filename("cd.wis");
// Free fft alloc'd memory and plans
fftw_destroy_plan(plan_forward_d_buffer);
fftw_destroy_plan(plan_forward_A_vec);
fftw_destroy_plan(plan_backward_Ad_buffer);
fftw_free(A_vec_fft);
fftw_free(d_buffer_fft);
};
Here's the matlab portion:
% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once
% Find initial direction - this is the initial residual
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
d(i) = b(i);
r(i) = b(i);
end
% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);
while(norm(r)/norm_ro > 10^-6)
% Find Ad - use fft
Ad_buffer = ifft(A_vec_fft.*fft(d));
% Calculate rtr_buffer
rtr_buffer = r'*r;
% Calculate alpha
alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));
% Calculate new x
x = x + alpha*d(1:n);
% Calculate new residual
r = r - alpha*Ad_buffer(1:n);
% Calculate beta
beta = r'*r/(rtr_buffer);
% Calculate new direction vector
d(1:n) = r + beta*d(1:n);
% Update counter
total_counter = total_counter+1;
end
In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks
A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:
Based on the code you have, I can see two explanations for the speed difference:
the speed difference is explained by differences in levels of optimization of the FFT
the while loop in MATLAB is executed a significantly smaller number of times
I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)
Now, regarding FFT speed comparison:
Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
This is classic performance gain thanks to low-level and architecture-specific optimization.
Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.
FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.
I have found the following comment on the MathWorks website [1]:
Note on large powers of 2: For FFT dimensions that are powers of
2, between 2^14 and 2^22, MATLAB software uses special preloaded
information in its internal database to optimize the FFT computation.
No tuning is performed when the dimension of the FTT is a power of 2,
unless you clear the database using the command fftw('wisdom', []).
Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.
[1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)
EDIT: #wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!
First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.
In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).
In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.
Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.
X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
then
for (size_t i=0; i<N; ++i) {
X[i][0] = Xre[i];
X[i][1] = Xim[i];
}
then
for (size_t i=0; i<N; ++i) {
Yre[i] = Y[i][0];
Yim[i] = Y[i][1];
}
Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.
I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.
In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.
EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.
One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.
Cheers.
p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.