I am currently experimenting with the performance of OpenCL code using the GPU and using C++ on the CPU. I have written programs that compute the sum z = x + y, where z, x,and y are two dimensional arrays (matrices) for the GPU and the CPU. After testing these programs, I have found that the CPU is much more efficient at computing this sum than the GPU due to the slow transfer of data in the PCI bus between the GPU and the CPU. Now I want to determine how many more sums will be required in order to make using the GPU more efficient than the CPU. I plan to do this by increasing the sum z = x + y to z = x + y + y + y + y + ... and so on.
Will it be possible to make using the GPU more efficient than the CPU just by increasing the number of sums for this specific problem?
Just as an FYI: I am using a nVIDIA GeForce GT 640 graphics card and an i5 Intel core CPU.
Any help will be greatly appreciated.
EDIT:
Below I have attached my code on the CPU:
int main(int argc, const char * argv[])
{
//This value determines the size of the nxn (square array)
int n = 1000;
//Allocating the memory for the nxn arrays of floats.
float **x = (float**)malloc(sizeof(float*)*n);
float **y = (float**)malloc(sizeof(float*)*n);
float **z = (float**)malloc(sizeof(float*)*n);
//Initializing the arrays.
for(int i = 0; i<n; i++){
x[i] = (float*)malloc(sizeof(float)*n);
y[i] = (float*)malloc(sizeof(float)*n);
z[i] = (float*)malloc(sizeof(float)*n);
for(int j = 0; j<n; j++){
x[i][j] = i+j;
y[i][j] = i+j;
}
}
for(int i = 0; i<n; i++){
for(int j = 0; j<n; j++){
z[i][j] = x[i][j] + y[i][j];
for(int k = 0; k < 100; k++){
z[i][j] += y[i][j];
}
}
}
return 0;
}
And here is the C++ using OpenCL: (used to copy the data and execute the kernel on the GPU)
int n = 1000;
for(int i = 0; i<n; i++)
{
//Writing the data from the host to the device
err = clEnqueueWriteBuffer(queue, d_xx, CL_TRUE, 0, sizeof(float)*n, h_xx[i], 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not write to buffer d_xx" << std::endl;
exit(1);
}
err = clEnqueueWriteBuffer(queue, d_yy, CL_TRUE, 0, sizeof(float)*n, h_yy[i], 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not write to buffer d_yy" << std::endl;
exit(1);
}
//Setting the Kernel Arguments
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_xx);
if(err != CL_SUCCESS){
std::cout << "Error: Could not set kernel argument h_xx." << std::endl;
exit(1);
}
err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_yy);
if(err != CL_SUCCESS){
std::cout << "Error: Could not set kernel argument h_yy." << std::endl;
exit(1);
}
err = clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_zz);
if(err != CL_SUCCESS){
std::cout << "Error: Could not set kernel argument h_zz." << std::endl;
}
work_units_per_kernel = n;
//Executing the Kernel
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel, NULL, 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not execute kernel." << std::endl;
exit(1);
}
//Reading the Data from the Kernel
err = clEnqueueReadBuffer(queue, d_zz, CL_TRUE, 0, n*(sizeof(float)), h_zz[i], 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not read data from kernel." << std::endl;
exit(1);
}
}
And lastly the kernel code executed on the GPU:
__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{
int i = get_global_id(0);
d_cc[i] = d_aa[i] + d_bb[i];
for(int j = 0; j < 100; j++){
d_cc[i] += d_bb[i];
}
}
For n = 1000*1000 you are getting to the point where copying, operating, and copying back is worth it. As DarkZero pointed out, Global Memory is NOT optimal, so if you can cache your Global Memory to Local Memory or Thread Memory and use Local Work Groups, this will help tremendously on both the CPU and GPU.
Let's start with the Kernel. d_cc is referenced 100 times from Global Memory. A simple change in this case is to cache the global memory into thread memory, and then at the end copy the local back to global.
__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{
int i = get_global_id(0);
float t_d_cc = d_aa[i] + d_bb[i]; //make a thread only version of d_cc
for(int j = 0; j < 100; j++){
t_d_cc += d_bb[i];
}
d_cc[i] = t_d_cc; //copy the thread only back to global
}
Another change, dependent on hardware, is to cache d_aa and d_bb into local memory. This lets OpenCL take advantage of batch copies from Global Memory. This can be a bit more challenging because each OpenCL device has different sizes and multiples of Local Workgroup sizes that can be used.
For example, my i5 has a Maximum Workgroup size of 1024 and a Workgroup multiple of 1, so my Local Workgroups can be anything from 1 to 1024. My ATI-7970 has values of 256 and 64, respectively, so my Local Worksgroups needs to be 64, 128, etc. This is much more restrictive.
__kernel void arraysum(__global const float *d_aa,
__local float *l_d_aa,
__global const float *d_bb,
__local float *l_d_bb,
__global float *d_cc,
__local float *l_d_cc)
{
//In this example, the global_id(1) is the number of rows and global_id(0) is the columns
//So when the kernel is called, the local work group size needs to be the size of the
//number of columns
int i = get_global_id(1)*get_global_size(0) + get_global_id(0); //Index of the row
int j = get_local_id(0);
l_d_aa[get_local_id(0)] = d_aa[i];
l_d_bb[get_local_id(0)] = d_bb[i];
read_mem_fence(CLK_LOCAL_MEM_FENCE);
float l_d_cc[get_local_id(0)] = l_d_aa[get_local_id(0)] + l_d_bb[get_local_id(0)];
for(int j = 0; j < get_global_size(0); j++){
l_d_cc[get_local_id(0)] += l_d_bb[j];
}
d_cc[i] = l_d_cc[get_local_id(0)]; //copy the thread only back to global
}
I apologize if I got the algorithm wrong, but hopefully it conveys how to cache global memory to local memory. Again, on the i5, the local workgroup size can be 1 to 1024, but the ATI7970 is restricted to column sizes of 64, 128, etc.
It is conceptually much more difficult, but the performance for OpenCL is much, much better when using this approach.
Community, please feel free clean up the kernel.
Many things are slowing you down:
1- Abuse of use of global memory. Each global memory access is like 400times slower, and you ONLY use global memory (like 200 read/writes). Global memory must be used only to read at the begining, and write at the end, never as an intermediate value.
2- Your N length is very short. CPU will finish in just 1000 instructions, while all the latencies in the GPU are much slower than this. Because a 100MB copy is much more efficient than a 1 byte copy, there are overheads in the copy operations.
3- Probably, CPU code is being optimized by the compiler into multiplications, while GPU code can't since it is accessing volatile variables like globals.
4- Memory read/writes to/from device are very expensive, if you include this into the calc, CPU will easily win. Also OpenCL buffer and kernels creations are very expensive. Note you are also using blocking write calls, this is much slower than non-blocking calls.
Related
I build a simple cuda kernel that performs a sum on elements. Each thread adds an input value to an output buffer. Each thread calculates one value. 2432 threads are being used (19 blocks * 128 threads).
The output buffer remains the same, the input buffer pointer is shifted by threadcount after each kernel execution. So in total, we have a loop invoking the add kernel until we computed all input data.
Example:
All my input values are set to 1. The output buffer size is 2432. The input buffer size is 2432 *2000.
2000 times the add kernel is called to add 1 to each field of output. The endresult in output is 2000 at every field. I call the function aggregate which contains a for loop, calling the kernel as often as needed to pass over the complete input data.
This works so far unless I call the kernel too often.
However if I call the Kernel 2500 times, I get an illegalmemoryaccess cuda error.
As you can see, the runtime of the last successfull kernel increases by 3 orders of magnitude. Afterwards my pointers are invalidated and the following invocations result in CudaErrorIllegalAdress.
I cleaned up the code to get a minimal working example:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <iostream>
using namespace std;
template <class T> __global__ void addKernel_2432(int *in, int * out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
out[i] = out[i] + in[i];
}
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
cout << "ITERATIONS: " << vectorCount << endl;
for (size_t i = 0; i < vectorCount-1; i++)
{
addKernel_2432<int><<<19,128>>>(array, out);
array += vectorCount;
}
addKernel_2432<int> << <19, 128 >> > (array, out);
return 1;
}
int main()
{
int* dev_in1 = 0;
size_t vectorCount = 2432;
int * dev_out = 0;
size_t datacount = 2432*2500;
std::vector<int> hostvec(datacount);
//create input buffer, filled with 1
std::fill(hostvec.begin(), hostvec.end(), 1);
//allocate input buffer and output buffer
cudaMalloc(&dev_in1, datacount*sizeof(int));
cudaMalloc(&dev_out, vectorCount * sizeof(int));
//set output buffer to 0
cudaMemset(dev_out, 0, vectorCount * sizeof(int));
//copy input buffer to GPU
cudaMemcpy(dev_in1, hostvec.data(), datacount * sizeof(int), cudaMemcpyHostToDevice);
//call kernel datacount / vectorcount times
aggregate(dev_in1, datacount, dev_out);
//return data to check for corectness
cudaMemcpy(hostvec.data(), dev_out, vectorCount*sizeof(int), cudaMemcpyDeviceToHost);
if (cudaSuccess != cudaMemcpy(hostvec.data(), dev_out, vectorCount * sizeof(int), cudaMemcpyDeviceToHost))
{
cudaError err = cudaGetLastError();
cout << " CUDA ERROR: " << cudaGetErrorString(err) << endl;
}
else
{
cout << "NO CUDA ERROR" << endl;
cout << "RETURNED SUM DATA" << endl;
for (int i = 0; i < 2432; i++)
{
cout << hostvec[i] << " ";
}
}
cudaDeviceReset();
return 0;
}
If you compile and run it, you get an error.
Change:
size_t datacount = 2432 * 2500;
to
size_t datacount = 2432 * 2400;
and it gives the correct results.
I am looking for any ideas, why it breaks after 2432 kernel invocations.
What i have found so far googeling around:
Wrong target architecture set. I use a 1070ti. My target is set to: compute_61,sm_61 In visual studio project properties. That does not change anything.
Did I miss something? Is there a limit how many times a kernel can be called until cuda invalidates pointer? Thank you for your help. I used windows, Visual Studio 2019 and CUDA runtime 11.
This is the output in both cases. Succes and failure:
[
Error:
[
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
for (size_t i = 0; i < vectorCount-1; i++)
{
array += vectorCount;
}
}
That's not vectorCount but the number of iterations you have been accidentally incrementing by. Works fine while vectorCount <= 2432 (but yields wrong results), and results in buffer overflow above.
array += 2432 is what you intended to write.
I want to implement the pinned memory feature of GPU in my code. For doing that I write my code like this:
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
cudaSetDeviceFlags(cudaDeviceMapHost);
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostAlloc((void**)&M, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&N, bytes, cudaHostAllocMapped);
cudaHostAlloc((void**)&P, bytes, cudaHostAllocMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaThreadSynchronize();
cudaDeviceSynchronize();
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) <<
std::endl;
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
return false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaFreeHost(M);
cudaFreeHost(N);
cudaFreeHost(P);
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
// Success
return true;
}
Now for evaluating performance on my device I call this function 1000 times and then compute the average time which it takes to run:
int main(){
// Timing data
float tcpuadd, tcpusub, tcpuscale, tgpuadd, tgpusub, tgpuscale, sum, delta, L2norm;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = (float) rand()/(RAND_MAX);
N[i] = (float) rand()/(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
The problem is, for the first time this function works without any errors. But the second time when I call this function, I've got error:
cannot set when device is active in this process
So does anyone know what it is all about?
If you do a better job of cuda error checking by checking the return value of each runtime API call, you'll discover that this error is returned from the second time you call this:
cudaSetDeviceFlags(cudaDeviceMapHost);
Note that description of this runtime API call:
If the current device has been set and that device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess.
The solution is to call the function only once, at the beginning of your application, not every time you call the addVectorGPU function. Take that call out of the addVectorGPU function, and put it in your main routine, prior to the first call of addVectorGPU.
Based on the question below, there are various other issues with the code:
I would suggest implementing proper cuda error checking on all kernel calls and all CUDA API calls, rather than once at the end of the routine.
The usage of cudaHostAlloc is incorrect. The intent of the program appears to be to pass host pointers to host-resident data to the GPU routine, and then add that data using a zero-copy technique. This is technically feasible (although it will be very slow), but the correct approach would involve the use of cudaHostRegister, not cudaHostAlloc. cudaHostAlloc creates a new allocation, so the existing data passed to the function would not be used or referenced that way.
Here's a worked example, based on what you have shown. Note that I personally would not benchmark things this way, but I am providing this to show that the process can work in an error-free way:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>
#define TILE_SIZE 512
#define SIZE 1048576
#define ITERS 10
bool addVectorCPU(float *M, float *N, float *P, int size){
for (int i=0; i< size; i++) P[i] = M[i]+N[i];
return true;
}
__global__ void addVectorKernel(float *M, float *N, float *P,int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size)
P[idx] = M[idx]+N[idx];
}
bool addVectorGPU(float* M, float* N, float* P, int size)
{
// Error return value
cudaError_t status;
// Number of bytes in the matrix.
int bytes = size * sizeof(float);
// Pointers to the device arrays
float *Md, *Nd, *Pd;
// Allocate memory on the device to store each matrix
cudaHostRegister(M, bytes, cudaHostRegisterMapped);
cudaHostRegister(N, bytes, cudaHostRegisterMapped);
cudaHostRegister(P, bytes, cudaHostRegisterMapped);
// Copy the host input data to the device
cudaHostGetDevicePointer((void**)&Md, M, 0);
cudaHostGetDevicePointer((void**)&Nd, N, 0);
cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Specify the size of the grid and the size of the block
dim3 dimBlock(TILE_SIZE); // Matrix is contained in a block
dim3 dimGrid((int)ceil((float)size / (float)TILE_SIZE));
// Launch the kernel on a size-by-size block of threads
addVectorKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, size);
// Wait for completion
cudaDeviceSynchronize();
bool res = true;
// Check for errors
status = cudaGetLastError();
if (status != cudaSuccess) {
std::cout << "Kernel failed: " << cudaGetErrorString(status) << std::endl;
res = false;
}
// Retrieve the result matrix
//cudaHostGetDevicePointer((void**)&Pd, P, 0);
// Free device memory
cudaHostUnregister(M);
cudaHostUnregister(N);
cudaHostUnregister(P);
// Success
return res;
}
int main(){
// Timing data
float tcpuadd, tgpuadd;
clock_t start, end;
bool success;
//Allocate the four vectors of SIZE floats
float* M = new float[SIZE];
float* N = new float[SIZE];
float* Pcpu = new float[SIZE];
float* Pgpu = new float[SIZE];
//Initialize M and N to random integers
for (int i = 0; i < SIZE; i ++){
M[i] = rand()/(float)(RAND_MAX);
N[i] = rand()/(float)(RAND_MAX);
}
printf("Operating on a vector of length %d\n", SIZE);
//Add two vectors and compute timing in CPU
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorCPU(M, N, Pcpu, SIZE);
}
end = clock();
tcpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf( "CPU Addition took %f ms\n", tcpuadd);
//Add two vectors and compute timing in GPU
cudaSetDeviceFlags(cudaDeviceMapHost);
success = addVectorGPU(M, N ,Pgpu , SIZE);
if(!success)
{
printf("Device Error!\n");
return 1;
}
//compute GPU timing
start = clock();
for (int i = 0; i < ITERS; i++) {
addVectorGPU(M, N, Pgpu, SIZE);
}
end = clock();
tgpuadd = (float)(end - start) * 1000 / (float)CLOCKS_PER_SEC / ITERS;
printf("GPU Addition took %f ms\n", tgpuadd);
}
Note I've made a few other changes, as well. For example cudaThreadSynchronize() is deprecated, and it's not necessary to use both cudaThreadSynchronize() and cudaDeviceSynchronize(); they are redundant.
I wrote a small OpenCL application which calculates the product of two matrices. Now I've noticed that if the size of the matrix exceeds 8192 x 8192 there is a significant performance drop (calculation for a 16384 x 16384 is ~80 times slower) and even the serial implementation is over 5 times faster. Here is the host code:
/*Make some includes and definitions here*/
#include "stdafx.h"
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include "util.hpp" // utility library
#define __CL_ENABLE_EXCEPTIONS
#define ROWS (16384) // ROWS of vectors a, b, and c
#define COLUMNS (16384)
/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
#include "metrics.h"
/*Start main()*/
int main(void)
{
int A;
// Fill vectors X and Y with random float values
float* h_x = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_x[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
}
}
float* h_y = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_y[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
}
}
float* h_s = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_s[j + i*COLUMNS] = 0.0;
}
}
/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
// Get all platforms (drivers)
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
if (all_platforms.size() == 0){ // Check for issues
std::cout << " No platforms found. Check OpenCL installation!\n";
exit(1);
}
cl::Platform default_platform = all_platforms[0];
std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";
// Get default device of the default platform
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
if (all_devices.size() == 0){ // Check for issues
std::cout << " No devices found. Check OpenCL installation!\n";
exit(1);
}
cl::Device default_device = all_devices[0];
std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";
// Create an OpenCL context
cl::Context context({ default_device });
cl::Program program(context, util::loadProgram("saxy_kernel.cl"), true);
if (program.build({ default_device }) != CL_SUCCESS){
std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
getchar();
exit(1);
}
// create buffers on the device
cl::Buffer buffer_X(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_Y(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_S(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int));
//create queue to which we will push commands for the device.
cl::CommandQueue queue(context, default_device);
//write arrays A and B to the device
queue.enqueueWriteBuffer(buffer_X, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_x[0]);
queue.enqueueWriteBuffer(buffer_Y, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_y[0]);
queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int), &A);
StartCounter();
//run the kernel
cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
kernel_add.setArg(0, buffer_X);
kernel_add.setArg(1, buffer_Y);
kernel_add.setArg(2, buffer_S);
kernel_add.setArg(3, buffer_A);
cl::NDRange global(ROWS*COLUMNS);
queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, global, cl::NullRange);
queue.finish();
std::cout << "Kernel execution time: " << GetCounter() << "ms \n";
//read result C from the device to array C
queue.enqueueReadBuffer(buffer_S, CL_TRUE, 0, sizeof(float)*ROWS*COLUMNS, &h_s[0]);
/*Print vectors
std::cout << "\nMatrix #1: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_x[i] << "\t ";
}
std::cout << "\n\nMatrix #2: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_y[i] << "\t ";
}
std::cout << "\n\nResult: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_s[i] << "\t ";
}*/
getchar();
return 0;
}
and here is the kernel:
__kernel void kernel simple_add(
__global float* X,
__global float* Y,
__global float* S,
__global int *A){
S[get_global_id(0)] = X[get_global_id(0)] * Y[get_global_id(0)];
}
Could you please explain me the reason? I know that I can achieve much better performance if I perform some algorithm optimizations, but I'm trying to figure out if this is the threshold of the "naive" implementation, or I'm doing something wrong (incorrect assignment of the work to groups).
EDIT: Because I was asked for in comments, the GPU I'm running the kernel is an AMD R9 270/2GB RAM. The CPU is an i7-4771 and the system has 8GB RAM.
Writing an answer about "how to do more calculations per thread" because code-formatting is non-existent in comments, and also covering a little on memory usage...
So, most OpenCL implementatins will need to run more than a couple of instructions per thread (and the right number of threads) for efficient performance. But like I said in comments, this is HIGHLY dependent on the actual architecture of the processing unit (GPU, CPU, or OpenCL-capable magical unit weaved from unicorn hair, whatever it may be) - each manufacturer of GPUs, CPUs and unicorn weavers have their own ideas of how to make a very efficient unit, and they all tend to change their mind as time flows too... ;)
To do a little more work in one thread you could simply do:
#define NUM_PER_THREAD 16
__kernel void kernel simple_add(
__global float* X,
__global float* Y,
__global float* S,
__global int *A)
{
for(i = 0; i < NUM_PER_THREAD; i++)
{
size_t index = get_global_id(0)*NUM_PER_THREAD + i;
S[index] = X[index] * Y[index];
}
}
[This will do 1 x 16 blocks. It gets a bit more fun to try to do 16 x 16 or something like that, but can be done if you know the size (width) of the matrix]
Regarding memory: GPU's that have dedicated local memory (in other words most graphics cards) will work MUCH faster if all the data fits in the graphics memory. Accessing "main" memory involves one of two approaches:
long access times for each cache-line when the GPU is reading over the PCI-express bus [or whatever infrastructure is used] - this can be 100 or 1000x slower than "local" memory. And the GPU also (most likely) has to ask the CPU if the memory content is in cache, and if so, wait further for the CPU to copy the data out to main memory...
"page in/out" where the GPU stops, sends an interrupt to the CPU,
the CPU finds some suitable lump [lump in this context is the technical term for "some amount of memory most likely around 4K or multiple thereof"] of memory to "remove" from the GPU
memory, and copies that out to main memory, then copies in the
required other lump of memory to the GPU memory - similar to when the OS is swapping memory to/from the hard-disk. And if you are unlucky, the GPU also has to do some interesting cache or TLB flushing to ensure that the correct data is being used.
Note that I still (in the last hour or so) haven't got any particular insight in how the AMD/ATI GPU's work, or how their OpenCL driver works. The above is a mixture of guessing/knowing how GPUs work in general, understanding of how OpenCL works in general, and calculating the memory needed to store the three different arrays of 16K x 16K using float.
I compared the performance of an OpenCL code, running on the CPU, which simply copies data from one 2D array into another to a pure C++ code which does the same thing. I used a single workgroup in the OpenCL code to make a fair comparison. I used Intel's OpenCL drivers and the Intel compiler. The OpenCL code is about 5 times slower than the CPU code. The compiler gives the following message for the copy loop:
loop was transformed to memset or memcpy.
Any suggestions on how to get the OpenCL code upto speed with the C++ code?
Thanks
OpenCL host code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
#include <CL/cl.hpp>
int main(int argc, char **argv)
{
// Create the two input vectors
const int N = 8192;
double *in = new double[N*N];
double *out = new double[N*N];
for(int i = 0; i < N; i++)
for (int j=0; j < N; j++) {
in[i*N + j] = i + j;
out[i*N + j] = 0.;
}
double time;
std::clock_t start;
int niter = 100;
cl_int cl_err;
std::vector<cl::Platform> platforms;
cl_err = cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
cl_err = platforms.at(1).getDevices(CL_DEVICE_TYPE_CPU,
&devices);
cl_context_properties context_properties[3] = {CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms.at(1)()),
0};
cl::Context context = cl::Context(devices,
context_properties,
NULL, NULL, &cl_err);
cl::Buffer buffer_in = cl::Buffer(context,
CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
N*N*sizeof(double),
in, &cl_err);
cl::Buffer buffer_out = cl::Buffer(context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
N*N*sizeof(double),
out, &cl_err);
cl::CommandQueue queue = cl::CommandQueue(context, devices.at(0), 0, &cl_err);
std::ifstream sourceFile("vector_copy.cl");
std::string sourceCode((std::istreambuf_iterator<char>(sourceFile)),
std::istreambuf_iterator<char>());
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(),
sourceCode.length()+1));
cl::Program program(context, source, &cl_err);
cl_err = program.build(devices, NULL, NULL, NULL);
cl::Kernel kernel(program, "vector_copy", &cl_err);
cl_err = kernel.setArg(0, buffer_in);
cl_err = kernel.setArg(1, buffer_out);
cl_err = kernel.setArg(2, N);
cl::NDRange global(N);
cl::NDRange local(N);
start = std::clock();
for (int n=0; n < niter; n++) {
cl_err = queue.enqueueNDRangeKernel(kernel,
cl::NullRange,
global,
local,
NULL, NULL);
cl_err = queue.finish();
}
time = (std::clock() - start)/(double)CLOCKS_PER_SEC;
std::cout << "Time/iteration OpenCL (s) = " << time/(double)niter << std::endl;
return(0);
}
OpenCL kernel code:
__kernel void vector_copy(__global const double* restrict in,
__global double* restrict out,
const int N)
{
int i = get_global_id(0);
int j;
for (j=0; j<N; j++) {
out[j + N*i] = in[j + N*i];
}
}
C++ code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
const int N = 8192;
int main(int argc, char **argv)
{
double *in = new double[N*N];
double *out = new double[N*N];
// Create the two input vectors
for(int i = 0; i < N; i++)
for (int j=0; j < N; j++) {
in[j + N*i] = i + j;
out[j + N*i] = 0.;
}
std::clock_t start;
int niter = 100;
start = std::clock();
for (int n=0; n < niter; n++) {
for (int i=0; i<N; i++)
for (int j=0; j<N; j++) {
out[j + N*i] = in[j + N*i];
}
}
double time = (std::clock() - start)/(double)CLOCKS_PER_SEC;
std::cout << "Time/iteration C = " << time/(double)niter << std::endl;
return(0);
}
Intel OpenCL compiler is able to vectorize across workgroups. Basically a single function runs, as an example, 8 threads at the same time in different SSE registers.
Your particular kernel does not do that. But it doesn't really matter. I tested your program using Visual Studio 2010 and the latest Intel OpenCL for applications. I was forced to reduce N from 8192 to 4096 because the integrated GPU I have reduces the maximum OpenCL buffer size into 128MB even if just the CPU is used.
My results: Your OpenCL kernel gave me around 6956MB/s of bandwidth. A trivially changed kernel (This is called with N*N as the global size and NULL as the local size because if we don't care about local memory at all then for CPU's we should leave it undefined).
__kernel void vector_copy2(__global const double* restrict in,
__global double* restrict out)
{
int i = get_global_id(0);
out[i] = in[i];
}
Gave about the same result (7006MB/s). This kernel was actually vectorized across threads, as can be verified using the Intel OpenCL kernel compiler. It produces one kernel for a some multiple (like 4) and one kernel for a single thread. Then it just runs the vectorized kernel until it has to run the single thread kernel for the last few workitems.
The C++ code gave 6494MB/s. So it's quite in line. I don't think it would be even possible for the ICC to make it 5x faster.
I noticed in your code you had platforms.at(1), what was at platform 0 in your computer?
Remember that if you don't care about local memory at all (you don't call get_local_id in your kernels) you should treat the local size for enqueueNDRange as a simple magic parameter. Either leave it as NULL or try to find a value that produces the fastest results.
The OpenCL code, even if optimized, it will still perform the copy 1by1 (work-item by work-item). Because the OpenCL compiler is only allowed to optimize in a per work item basis. While the C++ case will be optimized by the compiler into a memcpy() call probably (as the compiler is telling you).
If you disable the compiler optimizations it will perform much faster in the GPU.
BTW is there a reason for this? You have memcpy() in C++ and clEnqueueCopyBuffer() in OpenCL for this purpose. I think that latter one is what you should use.
I'm new to CUDA, and I been trying to figure out what I'm doing wrong here. CUDA is taking longer than just using the CPU to multiply a matrix. If I'm doing something wrong please let me know.
Here is my code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <assert.h>
#include <time.h>
#define size 100 // Matrix size
#define cols size // Matrix width
#define rows size // Matrix height
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
__global__ void matrixMul( int *A, int *B, int *C)
{
int bx = blockIdx.x; // Block index
int tx = threadIdx.x; // Thread index
int ts = blockDim.x; // number of threads
// Declaration of the shared memory C element
extern __shared__ int c_element_sum[];
c_element_sum[tx] = A[tx+((bx/ts)*ts)] * B[(bx%ts)+(tx*ts)];
//Block until all threads in the block have written their data to shared mem
__syncthreads();
int sum;
for(int i=0; i<ts; i++){
if(i==0){
sum=c_element_sum[i];
}
else{
sum+=c_element_sum[i];
}
}
C[bx] = sum;
}
/////////////////////////////////////////////////////////
// Program main
/////////////////////////////////////////////////////////
int main(int argc, char** argv)
{
//create timer.
clock_t t1, t2;
//start timer
t1=clock();
//allocate host memory for matrices
unsigned int size_A = cols * rows;
unsigned int mem_size_A = sizeof(int) * size_A;
int* mA = (int*) malloc(mem_size_A);
unsigned int size_B = cols * rows;
unsigned int mem_size_B = sizeof(int) * size_B;
int* mB = (int*) malloc(mem_size_B);
unsigned int size_C = cols * rows;
unsigned int mem_size_C = sizeof(int) * size_C;
int* mC = (int*) malloc(mem_size_C);
//initialize host memory
for (int i = 0; i < size_A; ++i){
mA[i] = 1;
mB[i] = 1;
mC[i] = 0;
}
// allocate device memory
int* d_mA;
int* d_mB;
int* d_mC;
cudaMalloc((void**) &d_mA, mem_size_A);
cudaMalloc((void**) &d_mB, mem_size_B);
cudaMalloc((void**) &d_mC, mem_size_C);
//copy host memory to device (A and B)
cudaMemcpy(d_mA, mA, mem_size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_mB, mB, mem_size_B, cudaMemcpyHostToDevice);
cudaMemcpy(d_mC, mC, mem_size_C, cudaMemcpyHostToDevice);
// setup execution parameters
int numThreadsPerBlock = cols;
int numBlocks = (cols * rows);
int sharedMemSize = numThreadsPerBlock * sizeof(int);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
matrixMul <<< dimGrid, dimBlock, sharedMemSize >>>(d_mA, d_mB, d_mC);
//Block until device has completed
cudaThreadSynchronize();
// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");
//copy result from device to host
cudaMemcpy(mC, d_mC, mem_size_C, cudaMemcpyDeviceToHost);
// Check for any CUDA errors
checkCUDAError("memcpy");
//stop timer
t2 = clock();
//check results
for (int i = 0; i < size_C; ++i){
assert(mC[i] == cols);
}
//clean up memory
free(mA);
free(mB);
free(mC);
cudaFree(d_mA);
cudaFree(d_mB);
cudaFree(d_mC);
printf("WITH CUDA - clocks: %d \n\n", t2-t1);
//////////////////////////////
///////// CPU ONLY //////////
/////////////////////////////
//create timer.
clock_t cpu_t1, cpu_t2;
//start timer
cpu_t1=clock();
//allocate host memory for matrices
unsigned int cpu_size_A = cols * rows;
unsigned int cpu_mem_size_A = sizeof(int) * cpu_size_A;
int* cpu_mA = (int*) malloc(cpu_mem_size_A);
unsigned int cpu_size_B = cols * rows;
unsigned int cpu_mem_size_B = sizeof(int) * cpu_size_B;
int* cpu_mB = (int*) malloc(cpu_mem_size_B);
unsigned int cpu_size_C = cols * rows;
unsigned int cpu_mem_size_C = sizeof(int) * cpu_size_C;
int* cpu_mC = (int*) malloc(cpu_mem_size_C);
//initialize host memory
for (int i = 0; i < cpu_size_A; ++i){
cpu_mA[i] = 1;
cpu_mB[i] = 1;
cpu_mC[i] = 0;
}
int ts = cols;
for(int bx=0; bx<(cols*rows);bx++){
int sum = 0;
for(int tx=0; tx<cols; tx++){
sum += cpu_mA[tx+((bx/ts)*ts)] * cpu_mB[(bx%ts)+(tx*ts)];
}
cpu_mC[bx]=sum;
}
//stop timer
cpu_t2 = clock();
//check results
for (int i = 0; i < cpu_size_C; ++i){
assert(cpu_mC[i] == cols);
}
//clean up memory
free(cpu_mA);
free(cpu_mB);
free(cpu_mC);
printf("CPU ONLY - clocks: %d \n\n", cpu_t2-cpu_t1);
return 0;
}
Based on your program, this is expected. Your timer looks like it clocks the entire execution of the program, which would include copying to the device, computation time, and copying the results back. Given the rather small workload you've provided for the program (100x100 matrices), the overhead of the memory copies far outweighs any computational benefit you get when doing the computation with the kernel. Your kernel itself is also not the most efficient implementation.
I don't think you're doing anything wrong, it's just that you haven't provided a large enough chunk of work for the GPU and you could potentially further optimize your kernel. Note that simply scaling up the size of the chunk may not significantly improve the performance with respect to the CPU, since you would also be scaling up the memory management time. While it is relatively simple to write a first implementation of a program on CUDA, it is significantly more difficult to get good performance out of it. The most effective way to use CUDA is to have a high ratio of compute to memory transactions. For example, having a pipeline of several compute-intensive kernels to operate successively on a chunk of data, only needing host-device copying at the beginning and end.
If this is just a program to help you learn to code for CUDA, this is a great step and getting a deep understanding of how to optimize matrix multiplication kernels will serve you well in many other cases. If you are writing this kernel for use in a production piece of software, I would recommend you use the highly-optimized linear algebra library CUBLAS: http://developer.nvidia.com/cublas (or some other library where the hard work has been done for you already).