Reducing GPU-CPU data transfers in C++Amp - c++

I have encountered a following problem when trying to optimize my application with C++Amp: the data transfers. For me, there is no problem with copying data from CPU to GPU (as I can do it in the initial state of the application). The worse thing is that I need a fast access to the results computed by C++Amp kernels so the bottleneck between GPU and CPU is a pain. I read that there is a performance boost under Windows 8.1, however I am using Windows 7 and I am not planing to change it. I read about staging arrays but I don't know how they could help solve my problem. I need to return a single float value to the host and it seems that it is the most time consuming operation.
float Subset::reduction_cascade(unsigned element_count, concurrency::array<float, 1>& a)
{
static_assert(_tile_count > 0, "Tile count must be positive!");
//static_assert(IS_POWER_OF_2(_tile_size), "Tile size must be a positive integer power of two!");
assert(source.size() <= UINT_MAX);
//unsigned element_count = static_cast<unsigned>(source.size());
assert(element_count != 0); // Cannot reduce an empty sequence.
unsigned stride = _tile_size * _tile_count * 2;
// Reduce tail elements.
float tail_sum = 0.f;
unsigned tail_length = element_count % stride;
// Using arrays as a temporary memory.
//concurrency::array<float, 1> a(element_count, source.begin());
concurrency::array<float, 1> a_partial_result(_tile_count);
concurrency::parallel_for_each(concurrency::extent<1>(_tile_count * _tile_size).tile<_tile_size>(), [=, &a, &a_partial_result] (concurrency::tiled_index<_tile_size> tidx) restrict(amp)
{
// Use tile_static as a scratchpad memory.
tile_static float tile_data[_tile_size];
unsigned local_idx = tidx.local[0];
// Reduce data strides of twice the tile size into tile_static memory.
unsigned input_idx = (tidx.tile[0] * 2 * _tile_size) + local_idx;
tile_data[local_idx] = 0;
do
{
tile_data[local_idx] += a[input_idx] + a[input_idx + _tile_size];
input_idx += stride;
} while (input_idx < element_count);
tidx.barrier.wait();
// Reduce to the tile result using multiple threads.
for (unsigned stride = _tile_size / 2; stride > 0; stride /= 2)
{
if (local_idx < stride)
{
tile_data[local_idx] += tile_data[local_idx + stride];
}
tidx.barrier.wait();
}
// Store the tile result in the global memory.
if (local_idx == 0)
{
a_partial_result[tidx.tile[0]] = tile_data[0];
}
});
// Reduce results from all tiles on the CPU.
std::vector<float> v_partial_result(_tile_count);
copy(a_partial_result, v_partial_result.begin());
return std::accumulate(v_partial_result.begin(), v_partial_result.end(), tail_sum);
}
I checked that in the example above the most time-consuming operation is copy(a_partial_result, v_partial_result.begin());. I am trying to find a better approach.

So I think there's something else going on here. Have you tried running the original sample on which your code is based? This is available on CodePlex.
Load the samples solution and build the Reduction project in Release mode and then run it without the debugger attached. You should see some output like this.
Running kernels with 16777216 elements, 65536 KB of data ...
Tile size: 512
Tile count: 128
Using device : NVIDIA GeForce GTX 570
Total : Calc
SUCCESS: Overhead 0.03 : 0.00 (ms)
SUCCESS: CPU sequential 9.48 : 9.45 (ms)
SUCCESS: CPU parallel 5.92 : 5.89 (ms)
SUCCESS: C++ AMP simple model 25.34 : 3.19 (ms)
SUCCESS: C++ AMP simple model using array_view 62.09 : 20.61 (ms)
SUCCESS: C++ AMP simple model optimized 25.24 : 1.81 (ms)
SUCCESS: C++ AMP tiled model 29.70 : 7.27 (ms)
SUCCESS: C++ AMP tiled model & shared memory 30.40 : 7.56 (ms)
SUCCESS: C++ AMP tiled model & minimized divergence 25.21 : 5.77 (ms)
SUCCESS: C++ AMP tiled model & no bank conflicts 25.52 : 3.92 (ms)
SUCCESS: C++ AMP tiled model & reduced stalled threads 21.25 : 2.03 (ms)
SUCCESS: C++ AMP tiled model & unrolling 22.94 : 1.55 (ms)
SUCCESS: C++ AMP cascading reduction 20.17 : 0.92 (ms)
SUCCESS: C++ AMP cascading reduction & unrolling 24.01 : 1.20 (ms)
Note that none of the examples are taking anywhere near the time you code is. Although it's fair to say that the CPU is faster and data copy time is a big contributing factor here.
This is to be expected. Effective use of a GPU involves moving more than operations like reduction to the GPU. You need to move significant amount of compute to make up for the copy overhead.
Some things you should consider:
What happens with you run the sample from CodePlex?
Are you running a release build with optimization enabled?
Are you sure running are running against the actual GPU hardware and not against a WARP (software emulator) accelerator?
Some more information that would be helpful
what hardware are you using?
How large is your data set, both the input data and the size of the partial result array?

Related

Whats the difference between local and global memory fence in OpenCL?

I was trying to made a reduce sum, with PyOpenCL, similar to the example: https://dournac.org/info/gpu_sum_reduction . I'm trying to sum a vector with all values 1. The result should be 16384 in the first element.
However it seems that just some points are being gathered. Is it necessary a local index? Is there any race condition (when I run it twice the result is not the same)? Whats wrong with the following code?
import numpy as np
import pyopencl as cl
def readKernel(kernelFile):
with open(kernelFile, 'r') as f:
data=f.read()
return data
a_np = np.random.rand(128*128).astype(np.float32)
a_np=a_np.reshape((128,128))
print(a_np.shape)
device = cl.get_platforms()[0].get_devices(cl.device_type.GPU)[0]
print(device)
ctx=cl.Context(devices=[device])
#ctx = cl.create_some_context() #ask which context to use
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
prg = cl.Program(ctx,readKernel("kernel2.cl")).build()
prg.test(queue, a_np.shape, None, a_g)
cl.enqueue_copy(queue, a_np, a_g).wait()
np.savetxt("teste2.txt",a_np,fmt="%i")
The kernel is:
__kernel void test(__global float *count){
int id = get_global_id(0)+get_global_id(1)*get_global_size(0);
int nelements = get_global_size(0)*get_global_size(1);
count[id] = 1;
barrier(CLK_GLOBAL_MEM_FENCE);
for (int stride = nelements/2; stride>0; stride = stride/2){
barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
if (id < stride){
int s1 = count[id];
int s2 = count[id+stride];
count[id] = s1+s2;
}
}
barrier(CLK_GLOBAL_MEM_FENCE); //wait everyone update
}
The problem is that your kernel is implemented to do reduction within one workgroup and there is implicitely schedulled many workgroups.
Depending on the GPU there is different number of maximum work items per workgroup. For Nvidia that is 1024, AMD and Intel 256 (Intel in older GPUs had 512).
Lets assume for this example that maximum work items per workgroup on your GPU is 256. In this case the maximum 2d worgroup size can be 16x16, so if you use that size of your matrix your kernel will return correct result. Using the original size 128x128 and not specifying local size when scheduling the kernel the implementation calculates that for you and you are getting global size 128x128 and local size (very likely) 16x16 which means 8 worgroups are being scheduled.
In the current kernel each workgroup is starting calculation from different id but the indices are reduced until 0 so you have race condition hence different results each run.
You have 2 options to fix this:
Rewrite your kernel to calculate everything within one workgroup and schedule it with global,local size: (16x16),(16,16) or whatever your max work items per workgroup device has
Use global,local size: (128x128),(16x16) and each workgroup will calculate its result which then on the cpu side will have to be sum up for each workgroup to get the final result.
For 128x128 the first option will be preferred as it should perform faster and should be more straightforward to implement.

Multi Threading Performance in Multiplication of 2 Arrays / Images - Intel IPP

I'm using Intel IPP for multiplication of 2 Images (Arrays).
I'm using Intel IPP 8.2 which comes with Intel Composer 2015 Update 6.
I created a simple function to multiply too large images (The whole project is attached, see below).
I wanted to see the gains using Intel IPP Multi Threaded Library.
Here is the simple project (I also attached the complete project form Visual Studio):
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"
#include <ctime>
#include <iostream>
using namespace std;
const int height = 6000;
const int width = 6000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};
int main()
{
IppiSize size = {width, height};
double start = clock();
for (int i = 0; i < 200; i++)
ippiMul_32f_C1R(mInput_image, 6000 * 4, mInput_image, 6000 * 4, mOutput_image, 6000 * 4, size);
double end = clock();
double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);
cout << douration << endl;
cin.get();
return 0;
}
I compiled this project once using Intel IPP Single Threaded and once using Intel IPP Multi Threaded.
I tried different sizes of arrays and in all of them the Multi Threaded version yields no gains (Sometimes it is even slower).
I wonder, how come there is no gain in this task with multi threading?
I know Intel IPP uses the AVX and I thought maybe the task becomes Memory Bounded?
I tried another approach by using OpenMP manually to have Multi Threaded approach using Intel IPP Single Thread implementation.
This is the code:
#include "ippi.h"
#include "ippcore.h"
#include "ipps.h"
#include "ippcv.h"
#include "ippcc.h"
#include "ippvm.h"
#include <ctime>
#include <iostream>
using namespace std;
#include <omp.h>
const int height = 5000;
const int width = 5000;
Ipp32f mInput_image [1 * width * height];
Ipp32f mOutput_image[1 * width * height] = {0};
int main()
{
IppiSize size = {width, height};
double start = clock();
IppiSize blockSize = {width, height / 4};
const int NUM_BLOCK = 4;
omp_set_num_threads(NUM_BLOCK);
Ipp32f* in;
Ipp32f* out;
// ippiMul_32f_C1R(mInput_image, width * 4, mInput_image, width * 4, mOutput_image, width * 4, size);
#pragma omp parallel \
shared(mInput_image, mOutput_image, blockSize) \
private(in, out)
{
int id = omp_get_thread_num();
int step = blockSize.width * blockSize.height * id;
in = mInput_image + step;
out = mOutput_image + step;
ippiMul_32f_C1R(in, width * 4, in, width * 4, out, width * 4, blockSize);
}
double end = clock();
double douration = (end - start) / static_cast<double>(CLOCKS_PER_SEC);
cout << douration << endl;
cin.get();
return 0;
}
The results were the same, again, no gain of performance.
Is there a way to benefit from Multi Threading in this kind of task?
How can I validate whether a task becomes memory bounded and hence no benefit in parallelize it?
Are there benefit to parallelize task of multiplying 2 arrays on CPU with AVX?
The Computers I tried it on is based on Core i7 4770k (Haswell).
Here is a link to the Project in Visual Studio 2013.
Thank You.
Your images occupy 200 MB in total (2 x 5000 x 5000 x 4 bytes). Each block therefore consists of 50 MB of data. This is more than 6 times than the size of your CPU's L3 cache (see here). Each AVX vector multiplication operates on 256 bits of data, which is half a cache line, i.e. it consumes one cache line per vector instruction (half a cache line for each argument). A vectorised multiplication on Haswell has a latency of 5 cycles and the FPU can retire two such instructions per cycle (see here). The memory bus of i7-4770K is rated at 25.6 GB/s (theoretical maximum!) or no more than 430 million cache lines per second . The nominal speed of the CPU is 3.5 GHz. The AVX part is clocked a bit lower, let's say at 3.1 GHz. At that speed, it takes an order of magnitude more cache lines per second to fully feed the AVX engine.
In those conditions, a single thread of vectorised code saturates almost fully the memory bus of your CPU. Adding a second thread might result in a very slight improvement. Adding further threads only results in contentions and added overhead. The only way to speed up such a calculation is to increase the memory bandwidth:
run on a NUMA system with more memory controllers and therefore higher aggregate memory bandwidth, e.g. a multisocket server board;
switch to a different architecture with much higher memory bandwidth, e.g. Intel Xeon Phi or a GPGPU.
From some researching on my own, it looks like your total CPU cache is around 8MB. 6000*4/4 (6000 floats split into blocks of 4) is 6MB. Multiply this by 2 (in and out), and you're outside of the cache.
I haven't tested this, but increasing the number of blocks should increase the performannce. Try 8 to start out with (your CPU siports hyperthreading to 8 virtual cores).
Currently, each of the different processes spawned on OpenMP is having cache conflicts and having to (re)load from main memory. Reducing the size of the blocks can help with this. Having distinct cahces would effectively increase the size of your cache, but it seems thats not an option.
If you're just doing this as a proof of principle, you may want to test this by running it on your graphics card. Although, that can be even harder to implement properly.
If you run with hyperthread enabled you should try the openmp version of ipp with 1 thread per core and set omp_places=cores if ipp doesn't do it automatically. If you use Cilk_ ipp try varying cilk_workers.
You might try a test case large enough to span multiple 4kb pages. Then additional factors come into play. Ideally, ipp will put the threads to work on different pages. On Linux (or Mac?) transparent huge pages should kick in. On Windows, haswell CPU introduced hardware page prefetch which should reduce but not eliminate importance of thp.

FFTW vs Matlab FFT

I posted this on matlab central but didn't get any responses so I figured I'd repost here.
I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.
Here's the mex portion:
void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
// Check for wisdom
if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
mexPrintf("wisdom not loaded.\n");
} else {
mexPrintf("wisdom loaded.\n");
}
// Set FFTW Plan - use interleaved FFTW
fftw_plan plan_forward_d_buffer;
fftw_plan plan_forward_A_vec;
fftw_plan plan_backward_Ad_buffer;
fftw_complex *A_vec_fft;
fftw_complex *d_buffer_fft;
A_vec_fft = fftw_alloc_complex(n);
d_buffer_fft = fftw_alloc_complex(n);
// CREATE MASTER PLAN - Do this on an empty vector as creating a plane
// with FFTW_MEASURE will erase the contents;
// Use d_buffer
// This is somewhat dangerous because Ad_buffer is a vector; but it does not
// get resized so &Ad_buffer[0] should work
plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
// A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);
// Get A_vec_fft
fftw_execute(plan_forward_A_vec);
// Find initial direction - this is the initial residual
for (int i=0;i<n;i++) {
d_buffer[i] = b.value[i];
r_buffer[i] = b.value[i];
}
// Start CG iterations
norm_ro = norm(r_buffer);
double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {
// Find Ad - use fft
fftw_execute(plan_forward_d_buffer);
// Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
// has complex elements; Overwrite d_buffer_fft
for (int i=0;i<n;i++) {
d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
}
fftw_execute(plan_backward_Ad_buffer);
// Calculate r'*r
rtr_buffer = 0;
for (int i=0;i<n;i++) {
rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
}
// Calculate alpha
alpha = 0;
for (int i=0;i<n;i++) {
alpha = alpha + d_buffer[i]*Ad_buffer[i];
}
alpha = rtr_buffer/alpha;
// Calculate new x
for (int i=0;i<n;i++) {
x[i] = x[i] + alpha*d_buffer[i];
}
// Calculate new residual
for (int i=0;i<n;i++) {
r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
}
// Calculate beta
beta = 0;
for (int i=0;i<n;i++) {
beta = beta + r_buffer[i]*r_buffer[i];
}
beta = beta/rtr_buffer;
// Calculate new direction vector
for (int i=0;i<n;i++) {
d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
}
*total_counter = *total_counter+1;
if(*total_counter >= iteration_cutoff) {
// Set total_counter to -1, this indicates failure
*total_counter = -1;
break;
}
}
// Store Wisdom
fftw_export_wisdom_to_filename("cd.wis");
// Free fft alloc'd memory and plans
fftw_destroy_plan(plan_forward_d_buffer);
fftw_destroy_plan(plan_forward_A_vec);
fftw_destroy_plan(plan_backward_Ad_buffer);
fftw_free(A_vec_fft);
fftw_free(d_buffer_fft);
};
Here's the matlab portion:
% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once
% Find initial direction - this is the initial residual
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
d(i) = b(i);
r(i) = b(i);
end
% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);
while(norm(r)/norm_ro > 10^-6)
% Find Ad - use fft
Ad_buffer = ifft(A_vec_fft.*fft(d));
% Calculate rtr_buffer
rtr_buffer = r'*r;
% Calculate alpha
alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));
% Calculate new x
x = x + alpha*d(1:n);
% Calculate new residual
r = r - alpha*Ad_buffer(1:n);
% Calculate beta
beta = r'*r/(rtr_buffer);
% Calculate new direction vector
d(1:n) = r + beta*d(1:n);
% Update counter
total_counter = total_counter+1;
end
In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks
A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:
Based on the code you have, I can see two explanations for the speed difference:
the speed difference is explained by differences in levels of optimization of the FFT
the while loop in MATLAB is executed a significantly smaller number of times
I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)
Now, regarding FFT speed comparison:
Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
This is classic performance gain thanks to low-level and architecture-specific optimization.
Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.
FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.
I have found the following comment on the MathWorks website [1]:
Note on large powers of 2: For FFT dimensions that are powers of
2, between 2^14 and 2^22, MATLAB software uses special preloaded
information in its internal database to optimize the FFT computation.
No tuning is performed when the dimension of the FTT is a power of 2,
unless you clear the database using the command fftw('wisdom', []).
Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.
[1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)
EDIT: #wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!
First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.
In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).
In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.
Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.
X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
then
for (size_t i=0; i<N; ++i) {
X[i][0] = Xre[i];
X[i][1] = Xim[i];
}
then
for (size_t i=0; i<N; ++i) {
Yre[i] = Y[i][0];
Yim[i] = Y[i][1];
}
Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.
I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.
In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.
EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.
One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.
Cheers.
p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.

Optimization tips for a cuda code

I wrote a piece of code for computing Self Quotient Image (SQI) in MATLAB. And now i want to rewrite a part of it in parallel for speedup.
this part of code is:
siz=15;
X=normalize8(X);
[a,b]=size(X);
filt = fspecial('gaussian',[siz siz],sigma);
padsize = floor(siz/2);
padX = padarray(X,[padsize, padsize],'symmetric','both');
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
region = padX(i-padsize:i+padsize, j-padsize:j+padsize);
means= mean(region(:));
M=return_step(region, means);
filt1=filt.*M;
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1.*region))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
and return_step function:
function M=return_step(X, means)
[a,b]=size(X);
for i=1:a
for j=1:b
if X(i,j)>=means
M(i,j)=1;
end
end
end
I wrote below kernel function:
__global__ void returnstep(const double* x, double* m, double* filt, int leng, double mean, int i, int j, int width)
{
int idx=threadIdx.y*blockDim.x+threadIdx.x;
if(idx>=leng) return;
int ridx= (j+threadIdx.y)*width+threadIdx.x+i;
double xval= x[ridx];
if (xval>=mean) m[idx]=filt[idx]*xval;
else m[idx]=0;
}
and then changed the MATLAB code as follow:
kernel= parallel.gpu.CUDAKernel('returnstep.ptx', 'returnstep.cu');
kernel.ThreadBlockSize= [double(siz) double(siz) 1];
GM = gpuArray(zeros(siz,siz));
GpadX = gpuArray(padX);
Gfilt = gpuArray(filt);
%% Process image
t0 = tic; % -------------------------------------------------------------
Z=zeros(a,b);
for i=padsize+1:a+padsize
for j=padsize+1:b+padsize
means= mean(region(:));
GM= feval(kernel, GpadX, GM, Gfilt, siz*siz, means, i-padsize-1, j-padsize-1, padXwidth);
filt1= gather(GM);
summ=sum(sum(filt1));
filt1=(filt1/summ);
Z(i-padsize,j-padsize)=(sum(sum(filt1))/(siz*siz));
end
end
toc(t0) % -------------------------------------------------------------
my sequential code runs in 2.5s for a 330X200 image but the new parallel code's run time is 15s. I don't know why????
I need some advise for improving it. I am new in CUDA programming.
> help gather
...
X = GATHER(A) when A is a GPUArray, X is an array in the local workspace
with the data transferred from the GPU device.
....
filt1 = gather(GM) is copying GM from the GPU to the CPU in every step, which is very inefficient. You should move the entire computation inside the loop nest, or preferably the entire loop nest to the GPU kernel. Otherwise you can forget about any speedup.
My evaluation under Sobel filter shows the CPU outperforms GPU on small images. I think your image size is so small for comparison of CPU-GPU performance. Computation should be large enough to hide kernel and communication launch overhead.

CUDA Thrust slow when operating large vectors on my machine

I'm a CUDA beginner and reading on some thrust tutorials.I write a simple but terribly organized code and try to figure out the acceleration of thrust.(is this idea correct?). I try to add two vectors (with 10000000 int) to another vector, by adding array on cpu and adding device_vector on gpu.
Here is the thing:
#include <iostream>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#define N 10000000
int main(void)
{
float time_cpu;
float time_gpu;
int *a = new int[N];
int *b = new int[N];
int *c = new int[N];
for(int i=0;i<N;i++)
{
a[i]=i;
b[i]=i*i;
}
clock_t start_cpu,stop_cpu;
start_cpu=clock();
for(int i=0;i<N;i++)
{
c[i]=a[i]+b[i];
}
stop_cpu=clock();
time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
std::cout<<"Time to generate (CPU):"<<time_cpu<<std::endl;
thrust::device_vector<int> X(N);
thrust::device_vector<int> Y(N);
thrust::device_vector<int> Z(N);
for(int i=0;i<N;i++)
{
X[i]=i;
Y[i]=i*i;
}
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
thrust::transform(X.begin(), X.end(),
Y.begin(),
Z.begin(),
thrust::plus<int>());
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime,start,stop);
std::cout<<"Time to generate (thrust):"<<elapsedTime<<std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
getchar();
return 0;
}
The CPU results appear really fast, But gpu runs REALLY slow on my machine(i5-2320,4G,GTX 560 Ti), CPU time is about 26,GPU time is around 30! Did I just do the thrust wrong with stupid errors in my code? or was there a deeper reason?
As a C++ rookie, I checked my code over and over and still got a slower time on GPU with thrust, so I did some experiments to show the difference of calculating vectorAdd with five different approaches.
I use windows API QueryPerformanceFrequency() as unified time measurement method.
Each of the experiments looks like this:
f = large_interger.QuadPart;
QueryPerformanceCounter(&large_interger);
c1 = large_interger.QuadPart;
for(int j=0;j<10;j++)
{
for(int i=0;i<N;i++)//CPU array adding
{
c[i]=a[i]+b[i];
}
}
QueryPerformanceCounter(&large_interger);
c2 = large_interger.QuadPart;
printf("Time to generate (CPU array adding) %lf ms\n", (c2 - c1) * 1000 / f);
and here is my simple __global__ function for GPU array adding:
__global__ void add(int *a, int *b, int *c)
{
int tid=threadIdx.x+blockIdx.x*blockDim.x;
while(tid<N)
{
c[tid]=a[tid]+b[tid];
tid+=blockDim.x*gridDim.x;
}
}
and the function is called as:
for(int j=0;j<10;j++)
{
add<<<(N+127)/128,128>>>(dev_a,dev_b,dev_c);//GPU array adding
}
I add vector a[N] and b[N] to vector c[N] for a loop of 10 times by:
add array on CPU
add std::vector on CPU
add thrust::host_vector on CPU
add thrust::device_vector on GPU
add array on GPU. and here is the result
with N=10000000
and I get results:
CPU array adding 268.992968ms
CPU std::vector adding 1908.013595ms
CPU Thrust::host_vector adding 10776.456803ms
GPU Thrust::device_vector adding 297.156610ms
GPU array adding 5.210573ms
And this confused me, I'm not familiar with the implementation of template library. Did the performance really differs so much between containers and raw data structures?
Most of the execution time is being spent in your loop that is initializing X[i] and Y[i]. While this is legal, it's a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, then copy those to the device. As a test, modify your code like this (right after the loop where you are initializing the device vectors X[i] and Y[i]):
} // this is your line of code
std::cout<< "Starting GPU run" <<std::endl; //add this line
cudaEvent_t start, stop; //this is your line of code
You will then see that the GPU timing results appear almost immediately after that added line prints out. So all of the time you're waiting is spent in initializing those device vectors directly from host code.
When I run this on my laptop, I get a CPU time of about 40 and a GPU time of about 5, so the GPU is running about 8 times faster than the CPU for the sections of code you are actually timing.
If you create X and Y as host vectors, and then create analogous d_X and d_Y device vectors, the overall execution time will be shorter, like so:
thrust::host_vector<int> X(N);
thrust::host_vector<int> Y(N);
thrust::device_vector<int> Z(N);
for(int i=0;i<N;i++)
{
X[i]=i;
Y[i]=i*i;
}
thrust::device_vector<int> d_X = X;
thrust::device_vector<int> d_Y = Y;
and change your transform call to:
thrust::transform(d_X.begin(), d_X.end(),
d_Y.begin(),
Z.begin(),
thrust::plus<int>());
OK so you've now indicated that the CPU run measurement is faster than the GPU measurement. Sorry I jumped to conclusions. My laptop is an HP laptop with a 2.6GHz core i7 and a Quadro 1000M gpu. I'm running centos 6.2 linux. A few comments: if you're running any heavy display tasks on your GPU, that can detract from performance. Also, when benchmarking these things it's common practice to use the same mechanism for comparison, you can use cudaEvents for both if you want, it can time CPU code the same as GPU code. Also, it's common practice with thrust to do a warm up run that is untimed, then repeat the test for a measurement, and likewise it's common practice to run the test 10 times or more in a loop, then divide to get an average. In my case, I can tell the clocks() measurement is pretty coarse because successive runs will give me 30, 40 or 50. On the GPU measurement I get something like 5.18256. Some of these things may help, but I can't say exactly why your results and mine differ so much (on the GPU side).
OK I did another experiment. The compiler will make a big difference on CPU side. I compiled with -O3 switch and the CPU time dropped to 0. Then I converted the CPU timing measurement from the clocks() method to cudaEvents, and I got a CPU measured time of 12.4 (with -O3 optimization) and still 5.1 on GPU side.
Your mileage will vary based on timing method and which compiler you are using on the CPU side.
First, Y[i]=i*i; does not fit in an integer for 10M elements. Integers holds roughly 1e10 and your code needs 1e14.
Second, it looks like the timing of transform is correct and should be faster than the CPU, regardless of which library you're using. Robert's suggestion to initialize vectors on CPU and then transfer to GPU is a good one for this case.
Third, since we can't do the integer multiple, below is some simpler CUDA library code (using ArrayFire that I work on) to do similar with floats, for your benchmarking:
int n = 10e6;
array x = array(seq(n));
array y = x * x;
timer t = timer::tic();
array z = x + y;
af::eval(z); af::sync();
printf("elapsed seconds: %g\n", timer::toc( t));
Good luck!
I am running similar test recently using CUDA Thrust on my Quadro 1000m. I use the thrust::sort_by_key as a benchmark to test its performance and the result is too good to convince my boos.It takes 100+ms to sort 512MB pairs.
For your problem, I am confused for 2 things.
(1) Why you multiple this time_cpu by 1000? Without the 1000, it is already in seconds.
time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
(2) And, by mentioning 26, 30, 40, do you mean seconds or ms? The 'cudaEvent' report elapsed time in 'ms' not 's'.