I am trying to learn programming a GPU. My system environment is as follows:
OS: windows 10 pro
GPU: NVIDIA GTX 1080 Ti (display does not run on this; there is another gpu for that)
CUDA toolkit: v9.1
I wrote this simple program using CUDA to calculate FFT from scratch on a GPU. The algorithm follows the wikipedia example of Cooley-Tukey algorithm. The code uses recursive functions to calculate the FFT of an array of complex values.
#include <iostream>
#include <string>
#include "conio.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust\complex.h>
#include <cstdio>
#include <fstream>
using namespace std;
#define winSize 2048
#define winShift 1024
#define M_PI 3.14159265358979323846
__device__ void separate(thrust::complex<double>* a, int n)
{
thrust::complex<double>* b = new thrust::complex<double>[n / 2]; // get temp heap storage
for (int i = 0; i<n / 2; i++) // copy all odd elements to heap storage
b[i] = a[i * 2 + 1];
for (int i = 0; i<n / 2; i++) // copy all even elements to lower-half of a[]
a[i] = a[i * 2];
for (int i = 0; i<n / 2; i++) // copy all odd (from heap) to upper-half of a[]
a[i + n / 2] = b[i];
cudaFree(b); // delete heap storage
}
// N must be a power-of-2, or bad things will happen.
// Currently no check for this condition.
//
// N input samples in X[] are FFT'd and results left in X[].
// Because of Nyquist theorem, N samples means
// only first N/2 FFT results in X[] are the answer.
// (upper half of X[] is a reflection with no new information).
__global__ void fft2(thrust::complex<double>* X, int N)
{
if (N < 2)
{
// bottom of recursion.
// Do nothing here, because already X[0] = x[0]
}
else
{
separate(X, N); // all evens to lower half, all odds to upper half
fft2 << <1, 1 >> >(X, N / 2); // recurse even items
fft2 << <1, 1 >> >(X + N / 2, N / 2); // recurse odd items
// combine results of two half recursions
for (int k = 0; k<N / 2; k++)
{
thrust::complex<double> e = X[k]; // even
thrust::complex<double> o = X[k + N / 2]; // odd
// w is the "twiddle-factor"
thrust::complex<double> w = exp(thrust::complex<double>(0, -2.*M_PI*k / N));
X[k] = e + w * o;
X[k + N / 2] = e - w * o;
}
}
}
int main()
{
const int nSamples = 64;
double nSeconds = 0.02; // total time for sampling
double sampleRate = nSamples / nSeconds; // n Hz = n / second
double freqResolution = sampleRate / nSamples; // freq step in FFT result
thrust::complex<double> x[nSamples]; // storage for sample data
thrust::complex<double> X[nSamples]; // storage for FFT answer
thrust::complex<double> *d_arr1;
const int nFreqs = 5;
double freq[nFreqs] = { 2,4,8,32,72 }; // known freqs for testing
size_t n_byte = nSamples * sizeof(complex<double>);
// generate samples for testing
for (int i = 0; i<nSamples; i++)
{
x[i] = thrust::complex<double>(0., 0.);
// sum several known sinusoids into x[]
for (int j = 0; j < nFreqs; j++)
x[i] += sin(2 * M_PI*freq[j] * i); // / nSamples);
X[i] = x[i]; // copy into X[] for FFT work & result
}
// compute fft for this data
cudaMalloc((void**)&d_arr1, n_byte);
cudaMemcpy(d_arr1, X, n_byte, cudaMemcpyHostToDevice);
//launchKernel << <1, 1 >> >(d_arr1, nSamples);
fft2 << <1, 1 >> > (d_arr1, nSamples);
cudaMemcpy(X, d_arr1, n_byte, cudaMemcpyDeviceToHost);
printf(" n\tx[]\tX[]\tf\n"); // header line
// loop to print values
for (int i = 0; i<nSamples; i++)
{
printf("% 3d\t%+.3f\t%+.3f\t%g\n",
i, x[i].real(), abs(X[i]), i*freqResolution);
}
ofstream myfile("example_cuda.txt");
printf("I am trying to write to file\n");
if (myfile.is_open())
{
for (int count = 0; count < nSamples; count++)
{
myfile << x[count].real() << "," << abs(X[count]) << "," << count*freqResolution << "\n";
}
myfile.close();
}
}
I used the following command to compile the code using VS2015 command prompt:
nvcc -o fft_Wiki2.exe -c -arch=compute_35 -rdc=true
--expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
The compilation itself doesn't show any errors or warnings, but the executable does not run. When I try the
fft_Wiki2.exe
it simply says the version of this executable is incompatible with the 64 bit Windows version and so cannot execute. But I am using the --machine 64 option to force the executable version.
How do I get this program to execute ?
How do I get this program to execute ?
It isn't a program you are trying to run, it is an object file.
In your compilation command you pass -c:
nvcc -o fft_Wiki2.exe -c -arch=compute_35 -rdc=true --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
which means only compilation and no linking. What you would need to do is something like this:
nvcc -o fft_Wiki2.obj -c -arch=compute_35 -rdc=true --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
nvcc -o fft_Wiki2.exe -arch=compute_35 --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.obj
[Note I don't have access to a Windows development platform to check the accuracy of the commands]
The first command compiles and emits an object file. The second performs both host and device code linking and emits an executable which you should be able to run
Related
I am trying to use OpenACC in Windows. I am using GCC to compile. (with version 8.1.0)
I found a sample code online using OpenACC.
So using the command prompt, I typed as follows.
"C:\Users\chang>g++ -fopenacc -o C:\Users\chang\source\repos\Project18\Project18\testing.exe C:\Users\chang\source\repos\Project18\Project18\Source1.cpp"
And if I look at Performance in Task manager while the code is running, I don't see any change in GPU usage.
Also if I skip -fopenacc
"C:\Users\chang>g++ -o C:\Users\chang\source\repos\Project18\Project18\testing.exe C:\Users\chang\source\repos\Project18\Project18\Source1.cpp"
There is no difference in speed between with -fopenacc and without.
So I was wondering if there is a prerequisite before I use this OpenACC.
Below is the sample code I found.
Thanks in advance.
P.S
As far as I remember, I haven't downloaded openacc.h and tried to find it online but couldn't find where it is. Is this can be a problem? I think since I could run exe file this doesn't seem like a problem but just in case.
/*
* Copyright 2012 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <iostream>
#include <math.h>
#include <string.h>
#include <openacc.h>
#include <chrono>
#define NN 4096
#define NM 4096
using namespace std;
using namespace chrono;
double A[NN][NM];
double Anew[NN][NM];
int main(int argc, char** argv)
{
const int n = NN;
const int m = NM;
const int iter_max = 1000;
const double tol = 1.0e-6;
double error = 1.0;
memset(A, 0, n * m * sizeof(double));
memset(Anew, 0, n * m * sizeof(double));
for (int j = 0; j < n; j++)
{
A[j][0] = 1.0;
Anew[j][0] = 1.0;
}
printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
system_clock::time_point start = system_clock::now();
int iter = 0;
#pragma acc data copy(A), create(Anew)
while (error > tol && iter < iter_max)
{
error = 0.0;
#pragma acc kernels
for (int j = 1; j < n - 1; j++)
{
for (int i = 1; i < m - 1; i++)
{
Anew[j][i] = 0.25 * (A[j][i + 1] + A[j][i - 1]
+ A[j - 1][i] + A[j + 1][i]);
error = fmax(error, fabs(Anew[j][i] - A[j][i]));
}
}
#pragma acc kernels
for (int j = 1; j < n - 1; j++)
{
for (int i = 1; i < m - 1; i++)
{
A[j][i] = Anew[j][i];
}
}
if (iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
iter++;
}
system_clock::time_point end = system_clock::now();
std::chrono::duration<float> sec = end - start;
cout << sec.count() << endl;
}
At this time, GCC doesn't support GPU code offloading on Windows. See https://stackoverflow.com/a/59376314/664214, or http://mid.mail-archive.com/87d08zjlbd.fsf#euler.schwinge.homeip.net, for example. It's certainly possible to implement, but somebody needs to do it, or pay for the work.
I am a newbie in OpenCL and currently have some questions about its performance.
I have Intel(R) Core(TM) i5-4460 CPU # 3.20GHz + ubuntu + Beignet (Intel open source openCL library see: http://arrayfire.com/opencl-on-intel-hd-iris-graphics-on-linux/ http://www.freedesktop.org/wiki/Software/Beignet/)
I have simple bench
#define __CL_ENABLE_EXCEPTIONS
#include "CL/cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>
using namespace cl;
using namespace std;
void CPUadd(vector<float> & A, vector<float> & B, vector<float> & C)
{
for (int i = 0; i < A.size(); i++)
{
C[i] = A[i] + B[i];
}
}
int main(int argc, char* argv[]) {
Context(CL_DEVICE_TYPE_GPU);
static const unsigned elements = 1000000;
vector<float> data(elements, 6);
Buffer a(begin(data), end(data), true, false);
Buffer b(begin(data), end(data), true, false);
Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));
Program addProg(R"d(
kernel
void add( global const float * restrict const a,
global const float * restrict const b,
global float * restrict const c) {
unsigned idx = get_global_id(0);
c[idx] = a[idx] + b[idx] + a[idx] * b[idx] + 5;
}
)d", true);
auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
#if 1
for (int i = 0; i < 4000; i++)
{
add(EnqueueArgs(elements), a, b, c);
}
vector<float> result(elements);
cl::copy(c, begin(result), end(result));
#else
vector<float> result(elements);
for (int i = 0; i < 4000; i++)
{
CPUadd(data, data, result);
}
#endif
//std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}
According to my measurements Intel HD is 20x faster then single CPU (see bench above). It is seems too small to me, because in case of using 4x cores I will get only 5x speed-up on GPU. Am I wrote correct bench and speed-up seems to be realistic? Unfortunately clinfo in my case do not find CPU as OpenCL device so I can't do direct compare.
UPDATE
Measurements
$ g++ -o main main.cpp -lOpenCL -std=c++11
$ time ./main
real 0m37.316s
user 0m37.280s
sys 0m0.016s
$ g++ -o main main.cpp -lOpenCL -std=c++11
$ time ./main
real 0m2.349s
user 0m0.524s
sys 0m0.624s
Total: 2.349 - 0.524 = 1.825 for GPU
37.316 - 0.524 = 36.724 for CPU
36.724 / 1.825 = 20.12x faster than single CPU => 5x faster than full CPU.
The two implementation you are comparing are not functionally equivalent.
Your CPU implementation needs 30% less memory bandwidth (which may explain the performance). It is accessing only array A and B while the GPU kernel it is using 3 arrays a, b and c.
I wrote a number crunching algorithm. The idea is that:
A small main programs needs very few memory (starts at 2 MB)
Then, in a loop, it calls a function that needs quite some memory (around 100 MB) which should be released when the function end. In order to understand what's going on, the function is now always called with the same parameters.
It seems that the program slowly eats memory so I suspect a memory leak. I have tried Address Sanitizer from Clang and Pointer Checker from Intel but they don't find anything.
Now, I am looking at the memory consumption in my Activity Monitor (I am running OSX, but I get the same memory usage from the Unix command "top") and just before the big function is called, the program takes 2 MB. When running the function, the program takes 120 MB. What is strange is that when the program ends up the big function and comes back inside the loop, it now takes 37 MB! Then, when it goes back into the big function, it takes 130 MB. Again, coming back in the loop, it takes 36 MB, then in the big function it takes 140 MB...
So it is slowly drifting away, but not with a regular pattern. How should I trust the memory usage in "top"?
Can memory fragmentation increase the memory usage without memory leak?
I let the program run overnight, and here is the data I get:
In the first loop, the program takes 150 MB
2 hours later, after 68 loops, the program takes 220 MB
After one night and 394 loops, the program takes 480 MB
So it seems that the function that allocates and deallocates memory (about 120 MB) seems to "leak" 1 MB each time it is called.
First, make sure that over a long period of time (for example if one iteration takes a minute, run a couple hours) the growth continues. If the growths asyptotes then there's no problem. Next I would try valgrind. Then if that doesn't help, you'll have to binary search your code: Comment out bits until the growth stops. I would start by totally removing use of the MKL library (leave stubs if you want to) and see what happens. Next, change your vector to std::vector just to see if that helps it. After that you'll have to use your judgment.
I think that I have found the culprit: the MKL (the latest version as of today). I use Pardiso, and the following example leaks very slowly: about 0.1 MB every 13 seconds which leads to 280 MB overnight. These are the numbers I get from my simulation.
If you want to give it a try, you can compile it with:
icpc -std=c++11 pardiso-leak.cpp -o main -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -ldl -lpthread -lm
Thanks everyone for your help. I have reported the bug to Intel.
#include <iostream>
#include <vector>
#include "mkl_pardiso.h"
#include "mkl_types.h"
int main (int argc, char const *argv[])
{
const auto n = std::size_t{1000};
auto m = MKL_INT{n * n};
auto values = std::vector<double>();
auto column = std::vector<MKL_INT>();
auto row = std::vector<MKL_INT>();
row.push_back(1);
for(std::size_t j = 0; j < n; ++j) {
column.push_back(j + 1);
values.push_back(1.0);
column.push_back(j + n + 1);
values.push_back(0.1);
row.push_back(column.size() + 1);
}
for(std::size_t i = 1; i < n - 1; ++i) {
for(std::size_t j = 0; j < n; ++j) {
column.push_back(n * i + j - n + 1);
values.push_back(0.1);
column.push_back(n * i + j + 1);
values.push_back(1.0);
column.push_back(n * i + j + n + 1);
values.push_back(0.1);
row.push_back(column.size() + 1);
}
}
for(std::size_t j = 0; j < n; ++j) {
column.push_back((n - 1) * n + j - n + 1);
values.push_back(0.1);
column.push_back((n - 1) * n + j + 1);
values.push_back(1.0);
row.push_back(column.size() + 1);
}
auto y = std::vector<double>(m, 1.0);
auto x = std::vector<double>(m, 0.0);
auto pardiso_nrhs = MKL_INT{1};
auto pardiso_max_fact = MKL_INT{1};
auto pardiso_mnum = MKL_INT{1};
auto pardiso_mtype = MKL_INT{11};
auto pardiso_msglvl = MKL_INT{0};
MKL_INT pardiso_iparm[64];
for (int i = 0; i < 64; ++i) {
pardiso_iparm[i] = 0;
}
pardiso_iparm[0] = 1;
pardiso_iparm[1] = 2;
pardiso_iparm[3] = 0;
pardiso_iparm[4] = 0;
pardiso_iparm[5] = 0;
pardiso_iparm[7] = 0;
pardiso_iparm[8] = 0;
pardiso_iparm[9] = 13;
pardiso_iparm[10] = 1;
pardiso_iparm[11] = 0;
pardiso_iparm[12] = 1;
pardiso_iparm[17] = -1;
pardiso_iparm[18] = 0;
pardiso_iparm[20] = 0;
pardiso_iparm[23] = 1;
pardiso_iparm[24] = 0;
pardiso_iparm[26] = 0;
pardiso_iparm[27] = 0;
pardiso_iparm[30] = 0;
pardiso_iparm[31] = 0;
pardiso_iparm[32] = 0;
pardiso_iparm[33] = 0;
pardiso_iparm[34] = 0;
pardiso_iparm[59] = 0;
pardiso_iparm[60] = 0;
pardiso_iparm[61] = 0;
pardiso_iparm[62] = 0;
pardiso_iparm[63] = 0;
void* pardiso_pt[64];
for (int i = 0; i < 64; ++i) {
pardiso_pt[i] = nullptr;
}
auto error = MKL_INT{0};
auto phase = MKL_INT{11};
MKL_INT i_dummy;
double d_dummy;
PARDISO(pardiso_pt, &pardiso_max_fact, &pardiso_mnum, &pardiso_mtype,
&phase, &m, values.data(), row.data(), column.data(), &i_dummy,
&pardiso_nrhs, pardiso_iparm, &pardiso_msglvl, &d_dummy,
&d_dummy, &error);
phase = 22;
PARDISO(pardiso_pt, &pardiso_max_fact, &pardiso_mnum, &pardiso_mtype,
&phase, &m, values.data(), row.data(), column.data(), &i_dummy,
&pardiso_nrhs, pardiso_iparm, &pardiso_msglvl, &d_dummy,
&d_dummy, &error);
phase = 33;
for(size_t i = 0; i < 10000; ++i) {
std::cout << "i = " << i << std::endl;
PARDISO(pardiso_pt, &pardiso_max_fact, &pardiso_mnum, &pardiso_mtype,
&phase, &m, values.data(), row.data(), column.data(), &i_dummy,
&pardiso_nrhs, pardiso_iparm, &pardiso_msglvl, y.data(),
x.data(), &error);
}
phase = -1;
PARDISO(pardiso_pt, &pardiso_max_fact, &pardiso_mnum, &pardiso_mtype,
&phase, &m, values.data(), row.data(), column.data(), &i_dummy,
&pardiso_nrhs, pardiso_iparm, &pardiso_msglvl, &d_dummy,
&d_dummy, &error);
return 0;
}
I need a fast and efficient implementation for finding the index of the maximum value in an array in CUDA. This operation needs to be performed several times. I originally used cublasIsamax for this, however, it sadly returns the index of the maximum absolute value, which is not what I want. Instead, I'm using thrust::max_element, however the speed is rather slow in comparison to cublasIsamax. I use it in the following manner:
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;
The number of elements in the vector range between 10'000 and 20'000. The difference in speed between thrust::max_element and cublasIsamax is rather big. Perhaps I'm performing several memory transactions without knowing?
A more efficient implementation would be to write your own max-index reduction code in CUDA. It's likely that cublasIsamax is using something like this under the hood.
We can compare 3 approaches:
thrust::max_element
cublasIsamax
custom CUDA kernel
Here's a fully worked example:
$ cat t665.cu
#include <cublas_v2.h>
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <iostream>
#include <stdlib.h>
#define DSIZE 10000
// nTPB should be a power-of-2
#define nTPB 256
#define MAX_KERNEL_BLOCKS 30
#define MAX_BLOCKS ((DSIZE/nTPB)+1)
#define MIN(a,b) ((a>b)?b:a)
#define FLOAT_MIN -1.0f
#include <time.h>
#include <sys/time.h>
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
__device__ volatile float blk_vals[MAX_BLOCKS];
__device__ volatile int blk_idxs[MAX_BLOCKS];
__device__ int blk_num = 0;
template <typename T>
__global__ void max_idx_kernel(const T *data, const int dsize, int *result){
__shared__ volatile T vals[nTPB];
__shared__ volatile int idxs[nTPB];
__shared__ volatile int last_block;
int idx = threadIdx.x+blockDim.x*blockIdx.x;
last_block = 0;
T my_val = FLOAT_MIN;
int my_idx = -1;
// sweep from global memory
while (idx < dsize){
if (data[idx] > my_val) {my_val = data[idx]; my_idx = idx;}
idx += blockDim.x*gridDim.x;}
// populate shared memory
vals[threadIdx.x] = my_val;
idxs[threadIdx.x] = my_idx;
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i)
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();}
// perform block-level reduction
if (!threadIdx.x){
blk_vals[blockIdx.x] = vals[0];
blk_idxs[blockIdx.x] = idxs[0];
if (atomicAdd(&blk_num, 1) == gridDim.x - 1) // then I am the last block
last_block = 1;}
__syncthreads();
if (last_block){
idx = threadIdx.x;
my_val = FLOAT_MIN;
my_idx = -1;
while (idx < gridDim.x){
if (blk_vals[idx] > my_val) {my_val = blk_vals[idx]; my_idx = blk_idxs[idx]; }
idx += blockDim.x;}
// populate shared memory
vals[threadIdx.x] = my_val;
idxs[threadIdx.x] = my_idx;
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i)
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();}
if (!threadIdx.x)
*result = idxs[0];
}
}
int main(){
int nrElements = DSIZE;
float *d_vector, *h_vector;
h_vector = new float[DSIZE];
for (int i = 0; i < DSIZE; i++) h_vector[i] = rand()/(float)RAND_MAX;
h_vector[10] = 10; // create definite max element
cublasHandle_t my_handle;
cublasStatus_t my_status = cublasCreate(&my_handle);
cudaMalloc(&d_vector, DSIZE*sizeof(float));
cudaMemcpy(d_vector, h_vector, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
int max_index = 0;
unsigned long long dtime = dtime_usec(0);
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;
cudaDeviceSynchronize();
dtime = dtime_usec(dtime);
std::cout << "thrust time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
max_index = 0;
dtime = dtime_usec(0);
my_status = cublasIsamax(my_handle, DSIZE, d_vector, 1, &max_index);
cudaDeviceSynchronize();
dtime = dtime_usec(dtime);
std::cout << "cublas time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
max_index = 0;
int *d_max_index;
cudaMalloc(&d_max_index, sizeof(int));
dtime = dtime_usec(0);
max_idx_kernel<<<MIN(MAX_KERNEL_BLOCKS, ((DSIZE+nTPB-1)/nTPB)), nTPB>>>(d_vector, DSIZE, d_max_index);
cudaMemcpy(&max_index, d_max_index, sizeof(int), cudaMemcpyDeviceToHost);
dtime = dtime_usec(dtime);
std::cout << "kernel time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
return 0;
}
$ nvcc -O3 -arch=sm_20 -o t665 t665.cu -lcublas
$ ./t665
thrust time: 0.00075 max index: 10
cublas time: 6.3e-05 max index: 11
kernel time: 2.5e-05 max index: 10
$
Notes:
CUBLAS returns an index 1 higher than the others because CUBLAS uses 1-based indexing.
CUBLAS might be quicker if you used CUBLAS_POINTER_MODE_DEVICE, however for validation you would still have to copy the result back to the host.
CUBLAS with CUBLAS_POINTER_MODE_DEVICE should be asynchronous, so the cudaDeviceSynchronize() will be desirable for the host based timing I've shown here. In some cases, thrust can be asynchronous as well.
For convenience and results comparison between CUBLAS and the other methods, I am using all nonnegative values for my data. You may want to adjust the FLOAT_MIN value if you are using negative values as well.
If you're freaky about performance, you can try tuning the nTPB and MAX_KERNEL_BLOCKS parameters to see if you can max out performance on your specific GPU. The kernel code also arguably leaves some performance on the table by not switching carefully into a warp-synchronous mode for the final stages of the (two) threadblock reduction(s).
The threadblock reduction kernel uses a block-draining/last-block strategy to avoid the overhead of an additional kernel launch to perform the final reduction.
I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:
inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }
For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.
Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.
To speed this up, I've attempted the following:
Compile with g++ -O3. (Well, I'd been doing this from the beginning.)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.
What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?
(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")
Update: as requested, here's a full program:
#include <iostream>
#include <cstdlib>
#include <cstring>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
I compile and run it as follows:
$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0
real 0m1.052s
user 0m1.050s
sys 0m0.010s
(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)
Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.
Couple of ideas:
Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.
If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.
I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".
#include <iostream>
#include <cstdlib>
#include <cstring>
#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include <conio.h>
using namespace std;
size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void step_new() {
//size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float
*d_b_ic,
*d_w_ic,
*d_e_ic,
*d_x_ic,
*d_x_iw,
*d_x_ie,
*d_x_is,
*d_x_in,
*d_n_ic,
*d_s_ic;
d_b_ic = d_b;
d_w_ic = d_w;
d_e_ic = d_e;
d_x_ic = d_x;
d_x_iw = d_x;
d_x_ie = d_x;
d_x_is = d_x;
d_x_in = d_x;
d_n_ic = d_n;
d_s_ic = d_s;
for (size_t y = 1; y < d_ny - 1; ++y)
{
for (size_t x = 1; x < d_nx - 1; ++x)
{
/*d_x[ic] = d_b[ic]
- d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
*d_x_ic = *d_b_ic
- *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
- *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
//++ic; ++iw; ++ie; ++is; ++in;
d_b_ic++;
d_w_ic++;
d_e_ic++;
d_x_ic++;
d_x_iw++;
d_x_ie++;
d_x_is++;
d_x_in++;
d_n_ic++;
d_s_ic++;
}
//ic += 2; iw += 2; ie += 2; is += 2; in += 2;
d_b_ic += 2;
d_w_ic += 2;
d_e_ic += 2;
d_x_ic += 2;
d_x_iw += 2;
d_x_ie += 2;
d_x_is += 2;
d_x_in += 2;
d_n_ic += 2;
d_s_ic += 2;
}
}
void solve_original(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_original();
}
}
void solve_new(size_t iters) {
for (size_t i = 0; i < iters; ++i) {
step_new();
}
}
void clear(float *a) {
memset(a, 0, d_nx * d_ny * sizeof(float));
}
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d_b = new float[n]; clear(d_b);
d_w = new float[n]; clear(d_w);
d_e = new float[n]; clear(d_e);
d_s = new float[n]; clear(d_s);
d_n = new float[n]; clear(d_n);
if(argc < 3)
printf("app.exe (x)iters (o/n)algo\n");
bool bOriginalStep = (argv[2][0] == 'o');
size_t iters = atoi(argv[1]);
/*printf("Press any key to start!");
_getch();
printf(" Running speed test..\n");*/
__int64 freq, start, end, diff;
if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
throw "Not supported!";
freq /= 1000000; // microseconds!
{
::QueryPerformanceCounter((LARGE_INTEGER*)&start);
if(bOriginalStep)
solve_original(iters);
else
solve_new(iters);
::QueryPerformanceCounter((LARGE_INTEGER*)&end);
diff = (end - start) / freq;
}
printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
//_getch();
//cout << d_x[0] << endl; // prevent the thing from being optimized away
}
Run it like this:
app.exe 10000 o
app.exe 10000 n
"o" means old code, yours.
"n" is mine, the new one.
My results:
Speed (original):
1515028
1523171
1495988
Speed (new):
966012
984110
1006045
Improvement of about 30%.
The logic behind:
You've been using index counters to access/manipulate.
I use pointers.
While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window.
The you'll see the produced opcodes (assembly code).
Anyway, look:
int *x = ...;
x[3] = 123;
This tells the PC to put the pointer x at a register (say EAX).
The add it (3 * sizeof(int)).
Only then, set the value to 123.
The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.
I hope this helps.
Sidenote to stackoverflow.com's staff:
Great website, I hope I've heard of it long ago!
For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:
d_x[ic] = d_b[ic]
- d_e[ic] * d_x[ie]
- d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
- d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;
It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.
Poni's answer looks like the right one to me.
I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:
size_t d_nx = 128, d_ny = 128;
float *d_x;
struct D { float b,w,e,s,n; };
D *d;
void step() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
d_x[ic] = d[ic].b
- d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
- d[ic].s * d_x[is] - d[ic].n * d_x[in];
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }
int main(int argc, char **argv) {
size_t n = d_nx * d_ny;
d_x = new float[n]; clear(d_x);
d = new D[n]; memset(d,0,n * sizeof(D));
solve(atoi(argv[1]));
cout << d_x[0] << endl; // prevent the thing from being optimized away
}
For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).
(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)
I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.
I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.
Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.
I suggest putting in some prefetch statements and also researching "data oriented design":
void step_original() {
size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
for (size_t y = 1; y < d_ny - 1; ++y) {
for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
// although sorting by index name may increase speed too.
db_ic = d_b[ic];
dw_ic = d_w[ic];
dx_iw = d_x[iw];
de_ic = d_e[ic];
dx_ie = d_x[ie];
ds_ic = d_s[ic];
dx_is = d_x[is];
dn_ic = d_n[ic];
dx_in = d_x[in];
// Calculate
d_x[ic] = db_ic
- dw_ic * dx_iw - de_ic * dx_ie
- ds_ic * dx_is - dn_ic * dx_in;
++ic; ++iw; ++ie; ++is; ++in;
}
ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}
}
This differs from your second method since the values are copied to local temporary variables before the calculation is performed.