I wrote a small OpenCL application which calculates the product of two matrices. Now I've noticed that if the size of the matrix exceeds 8192 x 8192 there is a significant performance drop (calculation for a 16384 x 16384 is ~80 times slower) and even the serial implementation is over 5 times faster. Here is the host code:
/*Make some includes and definitions here*/
#include "stdafx.h"
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include "util.hpp" // utility library
#define __CL_ENABLE_EXCEPTIONS
#define ROWS (16384) // ROWS of vectors a, b, and c
#define COLUMNS (16384)
/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
#include "metrics.h"
/*Start main()*/
int main(void)
{
int A;
// Fill vectors X and Y with random float values
float* h_x = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_x[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
}
}
float* h_y = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_y[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
}
}
float* h_s = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_s[j + i*COLUMNS] = 0.0;
}
}
/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
// Get all platforms (drivers)
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
if (all_platforms.size() == 0){ // Check for issues
std::cout << " No platforms found. Check OpenCL installation!\n";
exit(1);
}
cl::Platform default_platform = all_platforms[0];
std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";
// Get default device of the default platform
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
if (all_devices.size() == 0){ // Check for issues
std::cout << " No devices found. Check OpenCL installation!\n";
exit(1);
}
cl::Device default_device = all_devices[0];
std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";
// Create an OpenCL context
cl::Context context({ default_device });
cl::Program program(context, util::loadProgram("saxy_kernel.cl"), true);
if (program.build({ default_device }) != CL_SUCCESS){
std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
getchar();
exit(1);
}
// create buffers on the device
cl::Buffer buffer_X(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_Y(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_S(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int));
//create queue to which we will push commands for the device.
cl::CommandQueue queue(context, default_device);
//write arrays A and B to the device
queue.enqueueWriteBuffer(buffer_X, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_x[0]);
queue.enqueueWriteBuffer(buffer_Y, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_y[0]);
queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int), &A);
StartCounter();
//run the kernel
cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
kernel_add.setArg(0, buffer_X);
kernel_add.setArg(1, buffer_Y);
kernel_add.setArg(2, buffer_S);
kernel_add.setArg(3, buffer_A);
cl::NDRange global(ROWS*COLUMNS);
queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, global, cl::NullRange);
queue.finish();
std::cout << "Kernel execution time: " << GetCounter() << "ms \n";
//read result C from the device to array C
queue.enqueueReadBuffer(buffer_S, CL_TRUE, 0, sizeof(float)*ROWS*COLUMNS, &h_s[0]);
/*Print vectors
std::cout << "\nMatrix #1: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_x[i] << "\t ";
}
std::cout << "\n\nMatrix #2: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_y[i] << "\t ";
}
std::cout << "\n\nResult: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_s[i] << "\t ";
}*/
getchar();
return 0;
}
and here is the kernel:
__kernel void kernel simple_add(
__global float* X,
__global float* Y,
__global float* S,
__global int *A){
S[get_global_id(0)] = X[get_global_id(0)] * Y[get_global_id(0)];
}
Could you please explain me the reason? I know that I can achieve much better performance if I perform some algorithm optimizations, but I'm trying to figure out if this is the threshold of the "naive" implementation, or I'm doing something wrong (incorrect assignment of the work to groups).
EDIT: Because I was asked for in comments, the GPU I'm running the kernel is an AMD R9 270/2GB RAM. The CPU is an i7-4771 and the system has 8GB RAM.
Writing an answer about "how to do more calculations per thread" because code-formatting is non-existent in comments, and also covering a little on memory usage...
So, most OpenCL implementatins will need to run more than a couple of instructions per thread (and the right number of threads) for efficient performance. But like I said in comments, this is HIGHLY dependent on the actual architecture of the processing unit (GPU, CPU, or OpenCL-capable magical unit weaved from unicorn hair, whatever it may be) - each manufacturer of GPUs, CPUs and unicorn weavers have their own ideas of how to make a very efficient unit, and they all tend to change their mind as time flows too... ;)
To do a little more work in one thread you could simply do:
#define NUM_PER_THREAD 16
__kernel void kernel simple_add(
__global float* X,
__global float* Y,
__global float* S,
__global int *A)
{
for(i = 0; i < NUM_PER_THREAD; i++)
{
size_t index = get_global_id(0)*NUM_PER_THREAD + i;
S[index] = X[index] * Y[index];
}
}
[This will do 1 x 16 blocks. It gets a bit more fun to try to do 16 x 16 or something like that, but can be done if you know the size (width) of the matrix]
Regarding memory: GPU's that have dedicated local memory (in other words most graphics cards) will work MUCH faster if all the data fits in the graphics memory. Accessing "main" memory involves one of two approaches:
long access times for each cache-line when the GPU is reading over the PCI-express bus [or whatever infrastructure is used] - this can be 100 or 1000x slower than "local" memory. And the GPU also (most likely) has to ask the CPU if the memory content is in cache, and if so, wait further for the CPU to copy the data out to main memory...
"page in/out" where the GPU stops, sends an interrupt to the CPU,
the CPU finds some suitable lump [lump in this context is the technical term for "some amount of memory most likely around 4K or multiple thereof"] of memory to "remove" from the GPU
memory, and copies that out to main memory, then copies in the
required other lump of memory to the GPU memory - similar to when the OS is swapping memory to/from the hard-disk. And if you are unlucky, the GPU also has to do some interesting cache or TLB flushing to ensure that the correct data is being used.
Note that I still (in the last hour or so) haven't got any particular insight in how the AMD/ATI GPU's work, or how their OpenCL driver works. The above is a mixture of guessing/knowing how GPUs work in general, understanding of how OpenCL works in general, and calculating the memory needed to store the three different arrays of 16K x 16K using float.
Related
I build a simple cuda kernel that performs a sum on elements. Each thread adds an input value to an output buffer. Each thread calculates one value. 2432 threads are being used (19 blocks * 128 threads).
The output buffer remains the same, the input buffer pointer is shifted by threadcount after each kernel execution. So in total, we have a loop invoking the add kernel until we computed all input data.
Example:
All my input values are set to 1. The output buffer size is 2432. The input buffer size is 2432 *2000.
2000 times the add kernel is called to add 1 to each field of output. The endresult in output is 2000 at every field. I call the function aggregate which contains a for loop, calling the kernel as often as needed to pass over the complete input data.
This works so far unless I call the kernel too often.
However if I call the Kernel 2500 times, I get an illegalmemoryaccess cuda error.
As you can see, the runtime of the last successfull kernel increases by 3 orders of magnitude. Afterwards my pointers are invalidated and the following invocations result in CudaErrorIllegalAdress.
I cleaned up the code to get a minimal working example:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <iostream>
using namespace std;
template <class T> __global__ void addKernel_2432(int *in, int * out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
out[i] = out[i] + in[i];
}
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
cout << "ITERATIONS: " << vectorCount << endl;
for (size_t i = 0; i < vectorCount-1; i++)
{
addKernel_2432<int><<<19,128>>>(array, out);
array += vectorCount;
}
addKernel_2432<int> << <19, 128 >> > (array, out);
return 1;
}
int main()
{
int* dev_in1 = 0;
size_t vectorCount = 2432;
int * dev_out = 0;
size_t datacount = 2432*2500;
std::vector<int> hostvec(datacount);
//create input buffer, filled with 1
std::fill(hostvec.begin(), hostvec.end(), 1);
//allocate input buffer and output buffer
cudaMalloc(&dev_in1, datacount*sizeof(int));
cudaMalloc(&dev_out, vectorCount * sizeof(int));
//set output buffer to 0
cudaMemset(dev_out, 0, vectorCount * sizeof(int));
//copy input buffer to GPU
cudaMemcpy(dev_in1, hostvec.data(), datacount * sizeof(int), cudaMemcpyHostToDevice);
//call kernel datacount / vectorcount times
aggregate(dev_in1, datacount, dev_out);
//return data to check for corectness
cudaMemcpy(hostvec.data(), dev_out, vectorCount*sizeof(int), cudaMemcpyDeviceToHost);
if (cudaSuccess != cudaMemcpy(hostvec.data(), dev_out, vectorCount * sizeof(int), cudaMemcpyDeviceToHost))
{
cudaError err = cudaGetLastError();
cout << " CUDA ERROR: " << cudaGetErrorString(err) << endl;
}
else
{
cout << "NO CUDA ERROR" << endl;
cout << "RETURNED SUM DATA" << endl;
for (int i = 0; i < 2432; i++)
{
cout << hostvec[i] << " ";
}
}
cudaDeviceReset();
return 0;
}
If you compile and run it, you get an error.
Change:
size_t datacount = 2432 * 2500;
to
size_t datacount = 2432 * 2400;
and it gives the correct results.
I am looking for any ideas, why it breaks after 2432 kernel invocations.
What i have found so far googeling around:
Wrong target architecture set. I use a 1070ti. My target is set to: compute_61,sm_61 In visual studio project properties. That does not change anything.
Did I miss something? Is there a limit how many times a kernel can be called until cuda invalidates pointer? Thank you for your help. I used windows, Visual Studio 2019 and CUDA runtime 11.
This is the output in both cases. Succes and failure:
[
Error:
[
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
for (size_t i = 0; i < vectorCount-1; i++)
{
array += vectorCount;
}
}
That's not vectorCount but the number of iterations you have been accidentally incrementing by. Works fine while vectorCount <= 2432 (but yields wrong results), and results in buffer overflow above.
array += 2432 is what you intended to write.
I am trying to learn both details on memory usage works, as well as how to measure it using C++. I know that under Windows, a quick way to retrieve the amount of RAM being used by the current application process, when including <Windows.h>, is:
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo( GetCurrentProcess( ), &info, sizeof(info) );
(uint64_t)info.WorkingSetSize;
Then, I used that to run a very simple test:
#include <iostream>
#include <Windows.h>"
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
const int N(1000000);
int x[N]; //in the second run, comment this line out
int y[N]; //in the second run, comment this line out
//int *x = new int[N]; //in the second run UNcomment this line out
//int *y = new int[N]; //in the second run UNcomment this line out
for (int i = 0; i < N; i++)
{
x[i] = 1;
y[i] = 2;
}
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize - currentUsedRAM;
std::cout << "Current RAM used: " << currentUsedRAM << "\n";
return 0;
}
What I don't understand at all when I run the code above, the output is: Current RAM used: 0, while I was expecting something around 8mb since I filled two 1D int arrays of 1 million entries each. Now, if I re-run the code but making x and y become dinamically allocated arrays, now the output is, as expected: Current RAM used: 8007680.
Why is that? How to make it detect memory-usage in both cases?
The compiler have optimised your code. If fact, for your first run, neither x or y is allocated. Considering that there is visible side effect : the return value of GetProcessMemoryInfo, this optimiszation seems kind of weird.
Anyway, you can prevent this by adding some other side effect, such as outputing the sum of each element of those two array, which will guarateen the crashing.
The memory allocating for local objects with automatic storage duration happens at the beginning of the enclosing code block and deallocated at the end. So your code can't measure the memory usage for any automatic sotrage duration variable in main(nor my deleted code snippet, Which I wasn't awared of). But things are different for those objects with dynamic storage duration, they are allocated per request.
I designed a test which involves recusion for the discussion in comment area. You can see that the memory usage increased if the program goes deeper. This is a proof to that it counts the memroy usage on stack. BTW, it isn't counting how many memory your objects need, but how many your program needs.
void foo(int depth, int *a, int *b, uint64_t usage) {
if (depth >= 100)
return ;
int x[100], y[100];
for (int i = 0; i < 100; i++)
{
x[i] = 1 + (a==nullptr?0:a[i]);
y[i] = 2 + (b==nullptr?0:b[i]);
}
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
std::cout << "Current RAM used: " << info.WorkingSetSize - usage << "\n";
foo(depth+1,x,y,usage);
int sum = 0;
for (int i=0; i<100; i++)
sum += x[i] + y[i];
std::cout << sum << std::endl;
}
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
foo(0, nullptr, nullptr, currentUsedRAM);
return 0;
}
/*
Current RAM used: 0
Current RAM used: 61440
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 73728
*/
The system allocate 4k each time, which is the size of a page. I don't know why it comes 0, and then suddenly 61440. Explaining how windows manages the memory is very hard and is far beyond my ability, though I have confident in the 4k thing... and that it do count the memory usage for variables with automatic storage duration.
I am just beginning to play with CUDA so I tried out a textbook vector addition code. However, when I specify kernel calls to only add the first half of vector, the second half also gets added! This behavior stops when I include some thrust library header.
I am totally confused. Please see the code below:
#include <iostream>
using namespace std;
__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC)
{
//printf("gridDim.x is %d \n",gridDim.x);
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// printf("tid is %d \n",tid);
d_resultC[tid] = d_dataA[tid] + d_dataB[tid];
}
int main()
{
const int ARRAY_SIZE = 8*1024;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
float *h_dataA, *h_dataB, *h_resultC;
float *d_dataA, *d_dataB, *d_resultC;
h_dataA = (float *)malloc(ARRAY_BYTES);
h_dataB = (float *)malloc(ARRAY_BYTES);
h_resultC = (float *)malloc(ARRAY_BYTES);
for(int i=0; i<ARRAY_SIZE;i++){
h_dataA[i]=i+1;
h_dataB[i]=2*(i+1);
};
cudaMalloc((void **)&d_dataA,ARRAY_BYTES);
cudaMalloc((void **)&d_dataB,ARRAY_BYTES);
cudaMalloc((void **)&d_resultC,ARRAY_BYTES);
cudaMemcpy(d_dataA, h_dataA,ARRAY_BYTES, cudaMemcpyHostToDevice);
cudaMemcpy(d_dataB, h_dataB,ARRAY_BYTES, cudaMemcpyHostToDevice);
cout << h_resultC[0] << endl;
cout << h_resultC[ARRAY_SIZE-1] << endl;
dim3 dimBlock(ARRAY_SIZE/8,1,1);
dim3 dimGrid(1,1,1);
VecAdd<<<dimGrid,dimBlock>>>(d_dataA, d_dataB, d_resultC);
cout << h_resultC[0] << endl;
cout << h_resultC[ARRAY_SIZE-1] << endl;
cudaMemcpy(h_resultC,d_resultC ,ARRAY_BYTES,cudaMemcpyDeviceToHost);
cout << h_resultC[0] << endl;
cout << h_resultC[ARRAY_SIZE-1] << endl;
return 0;
}
Have you launched it first with ARRAY_SIZE threads and then with the half of them? (or 1/8)
You are not initializing d_resultC, so it's probably that d_resultC has the result of the previous executions. That would explain that behavior, but maybe it doesn't.
Add a cudaMemset over d_result_C and tell us what happens.
I can't answer for sure why your kernel is processing more elements than expected. It's processing one elements per thread, so the number of elements processed definitely should be blockDim.x*gridDim.x.
I want to point out though, that it's good practice to write kernels that use "grid stride loops" so they aren't so dependent on the block and thread count. The performance cost is negligible and if you are performance-sensitive, the blocking parameters are different for different GPUs.
http://cudahandbook.to/15QbFWx
So you should add a count parameter (the number of elements to process), then write something like:
__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC, int N)
{
for ( int i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
d_resultC[i] = d_dataA[i] + d_dataB[i];
}
}
As some guys mentioned above. This may be caused by the remain data from your previous run. You didn't free the memory you allocated may be the reason of this odd situation.
I think you should free the allocated arrays on the host using free and also free the memory on the GPU using CudaFree
Also I strongly recommend you to allocate the host memory using CudaMallocHost instead of malloc and free them at the end of the program by CudaFreeHost. This will give you fast copy. See here: CudaMallocHost
Anyway, don't forget to free heap memory on C/C++ program, whether with CUDA or not.
I compared the performance of an OpenCL code, running on the CPU, which simply copies data from one 2D array into another to a pure C++ code which does the same thing. I used a single workgroup in the OpenCL code to make a fair comparison. I used Intel's OpenCL drivers and the Intel compiler. The OpenCL code is about 5 times slower than the CPU code. The compiler gives the following message for the copy loop:
loop was transformed to memset or memcpy.
Any suggestions on how to get the OpenCL code upto speed with the C++ code?
Thanks
OpenCL host code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
#include <CL/cl.hpp>
int main(int argc, char **argv)
{
// Create the two input vectors
const int N = 8192;
double *in = new double[N*N];
double *out = new double[N*N];
for(int i = 0; i < N; i++)
for (int j=0; j < N; j++) {
in[i*N + j] = i + j;
out[i*N + j] = 0.;
}
double time;
std::clock_t start;
int niter = 100;
cl_int cl_err;
std::vector<cl::Platform> platforms;
cl_err = cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
cl_err = platforms.at(1).getDevices(CL_DEVICE_TYPE_CPU,
&devices);
cl_context_properties context_properties[3] = {CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms.at(1)()),
0};
cl::Context context = cl::Context(devices,
context_properties,
NULL, NULL, &cl_err);
cl::Buffer buffer_in = cl::Buffer(context,
CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
N*N*sizeof(double),
in, &cl_err);
cl::Buffer buffer_out = cl::Buffer(context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
N*N*sizeof(double),
out, &cl_err);
cl::CommandQueue queue = cl::CommandQueue(context, devices.at(0), 0, &cl_err);
std::ifstream sourceFile("vector_copy.cl");
std::string sourceCode((std::istreambuf_iterator<char>(sourceFile)),
std::istreambuf_iterator<char>());
cl::Program::Sources source(1, std::make_pair(sourceCode.c_str(),
sourceCode.length()+1));
cl::Program program(context, source, &cl_err);
cl_err = program.build(devices, NULL, NULL, NULL);
cl::Kernel kernel(program, "vector_copy", &cl_err);
cl_err = kernel.setArg(0, buffer_in);
cl_err = kernel.setArg(1, buffer_out);
cl_err = kernel.setArg(2, N);
cl::NDRange global(N);
cl::NDRange local(N);
start = std::clock();
for (int n=0; n < niter; n++) {
cl_err = queue.enqueueNDRangeKernel(kernel,
cl::NullRange,
global,
local,
NULL, NULL);
cl_err = queue.finish();
}
time = (std::clock() - start)/(double)CLOCKS_PER_SEC;
std::cout << "Time/iteration OpenCL (s) = " << time/(double)niter << std::endl;
return(0);
}
OpenCL kernel code:
__kernel void vector_copy(__global const double* restrict in,
__global double* restrict out,
const int N)
{
int i = get_global_id(0);
int j;
for (j=0; j<N; j++) {
out[j + N*i] = in[j + N*i];
}
}
C++ code:
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <cmath>
#include <ctime>
const int N = 8192;
int main(int argc, char **argv)
{
double *in = new double[N*N];
double *out = new double[N*N];
// Create the two input vectors
for(int i = 0; i < N; i++)
for (int j=0; j < N; j++) {
in[j + N*i] = i + j;
out[j + N*i] = 0.;
}
std::clock_t start;
int niter = 100;
start = std::clock();
for (int n=0; n < niter; n++) {
for (int i=0; i<N; i++)
for (int j=0; j<N; j++) {
out[j + N*i] = in[j + N*i];
}
}
double time = (std::clock() - start)/(double)CLOCKS_PER_SEC;
std::cout << "Time/iteration C = " << time/(double)niter << std::endl;
return(0);
}
Intel OpenCL compiler is able to vectorize across workgroups. Basically a single function runs, as an example, 8 threads at the same time in different SSE registers.
Your particular kernel does not do that. But it doesn't really matter. I tested your program using Visual Studio 2010 and the latest Intel OpenCL for applications. I was forced to reduce N from 8192 to 4096 because the integrated GPU I have reduces the maximum OpenCL buffer size into 128MB even if just the CPU is used.
My results: Your OpenCL kernel gave me around 6956MB/s of bandwidth. A trivially changed kernel (This is called with N*N as the global size and NULL as the local size because if we don't care about local memory at all then for CPU's we should leave it undefined).
__kernel void vector_copy2(__global const double* restrict in,
__global double* restrict out)
{
int i = get_global_id(0);
out[i] = in[i];
}
Gave about the same result (7006MB/s). This kernel was actually vectorized across threads, as can be verified using the Intel OpenCL kernel compiler. It produces one kernel for a some multiple (like 4) and one kernel for a single thread. Then it just runs the vectorized kernel until it has to run the single thread kernel for the last few workitems.
The C++ code gave 6494MB/s. So it's quite in line. I don't think it would be even possible for the ICC to make it 5x faster.
I noticed in your code you had platforms.at(1), what was at platform 0 in your computer?
Remember that if you don't care about local memory at all (you don't call get_local_id in your kernels) you should treat the local size for enqueueNDRange as a simple magic parameter. Either leave it as NULL or try to find a value that produces the fastest results.
The OpenCL code, even if optimized, it will still perform the copy 1by1 (work-item by work-item). Because the OpenCL compiler is only allowed to optimize in a per work item basis. While the C++ case will be optimized by the compiler into a memcpy() call probably (as the compiler is telling you).
If you disable the compiler optimizations it will perform much faster in the GPU.
BTW is there a reason for this? You have memcpy() in C++ and clEnqueueCopyBuffer() in OpenCL for this purpose. I think that latter one is what you should use.
I am currently experimenting with the performance of OpenCL code using the GPU and using C++ on the CPU. I have written programs that compute the sum z = x + y, where z, x,and y are two dimensional arrays (matrices) for the GPU and the CPU. After testing these programs, I have found that the CPU is much more efficient at computing this sum than the GPU due to the slow transfer of data in the PCI bus between the GPU and the CPU. Now I want to determine how many more sums will be required in order to make using the GPU more efficient than the CPU. I plan to do this by increasing the sum z = x + y to z = x + y + y + y + y + ... and so on.
Will it be possible to make using the GPU more efficient than the CPU just by increasing the number of sums for this specific problem?
Just as an FYI: I am using a nVIDIA GeForce GT 640 graphics card and an i5 Intel core CPU.
Any help will be greatly appreciated.
EDIT:
Below I have attached my code on the CPU:
int main(int argc, const char * argv[])
{
//This value determines the size of the nxn (square array)
int n = 1000;
//Allocating the memory for the nxn arrays of floats.
float **x = (float**)malloc(sizeof(float*)*n);
float **y = (float**)malloc(sizeof(float*)*n);
float **z = (float**)malloc(sizeof(float*)*n);
//Initializing the arrays.
for(int i = 0; i<n; i++){
x[i] = (float*)malloc(sizeof(float)*n);
y[i] = (float*)malloc(sizeof(float)*n);
z[i] = (float*)malloc(sizeof(float)*n);
for(int j = 0; j<n; j++){
x[i][j] = i+j;
y[i][j] = i+j;
}
}
for(int i = 0; i<n; i++){
for(int j = 0; j<n; j++){
z[i][j] = x[i][j] + y[i][j];
for(int k = 0; k < 100; k++){
z[i][j] += y[i][j];
}
}
}
return 0;
}
And here is the C++ using OpenCL: (used to copy the data and execute the kernel on the GPU)
int n = 1000;
for(int i = 0; i<n; i++)
{
//Writing the data from the host to the device
err = clEnqueueWriteBuffer(queue, d_xx, CL_TRUE, 0, sizeof(float)*n, h_xx[i], 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not write to buffer d_xx" << std::endl;
exit(1);
}
err = clEnqueueWriteBuffer(queue, d_yy, CL_TRUE, 0, sizeof(float)*n, h_yy[i], 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not write to buffer d_yy" << std::endl;
exit(1);
}
//Setting the Kernel Arguments
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_xx);
if(err != CL_SUCCESS){
std::cout << "Error: Could not set kernel argument h_xx." << std::endl;
exit(1);
}
err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_yy);
if(err != CL_SUCCESS){
std::cout << "Error: Could not set kernel argument h_yy." << std::endl;
exit(1);
}
err = clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_zz);
if(err != CL_SUCCESS){
std::cout << "Error: Could not set kernel argument h_zz." << std::endl;
}
work_units_per_kernel = n;
//Executing the Kernel
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel, NULL, 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not execute kernel." << std::endl;
exit(1);
}
//Reading the Data from the Kernel
err = clEnqueueReadBuffer(queue, d_zz, CL_TRUE, 0, n*(sizeof(float)), h_zz[i], 0, NULL, NULL);
if(err != CL_SUCCESS){
std::cout << "Error: Could not read data from kernel." << std::endl;
exit(1);
}
}
And lastly the kernel code executed on the GPU:
__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{
int i = get_global_id(0);
d_cc[i] = d_aa[i] + d_bb[i];
for(int j = 0; j < 100; j++){
d_cc[i] += d_bb[i];
}
}
For n = 1000*1000 you are getting to the point where copying, operating, and copying back is worth it. As DarkZero pointed out, Global Memory is NOT optimal, so if you can cache your Global Memory to Local Memory or Thread Memory and use Local Work Groups, this will help tremendously on both the CPU and GPU.
Let's start with the Kernel. d_cc is referenced 100 times from Global Memory. A simple change in this case is to cache the global memory into thread memory, and then at the end copy the local back to global.
__kernel void arraysum(__global const float *d_aa, __global const float *d_bb, __global float *d_cc)
{
int i = get_global_id(0);
float t_d_cc = d_aa[i] + d_bb[i]; //make a thread only version of d_cc
for(int j = 0; j < 100; j++){
t_d_cc += d_bb[i];
}
d_cc[i] = t_d_cc; //copy the thread only back to global
}
Another change, dependent on hardware, is to cache d_aa and d_bb into local memory. This lets OpenCL take advantage of batch copies from Global Memory. This can be a bit more challenging because each OpenCL device has different sizes and multiples of Local Workgroup sizes that can be used.
For example, my i5 has a Maximum Workgroup size of 1024 and a Workgroup multiple of 1, so my Local Workgroups can be anything from 1 to 1024. My ATI-7970 has values of 256 and 64, respectively, so my Local Worksgroups needs to be 64, 128, etc. This is much more restrictive.
__kernel void arraysum(__global const float *d_aa,
__local float *l_d_aa,
__global const float *d_bb,
__local float *l_d_bb,
__global float *d_cc,
__local float *l_d_cc)
{
//In this example, the global_id(1) is the number of rows and global_id(0) is the columns
//So when the kernel is called, the local work group size needs to be the size of the
//number of columns
int i = get_global_id(1)*get_global_size(0) + get_global_id(0); //Index of the row
int j = get_local_id(0);
l_d_aa[get_local_id(0)] = d_aa[i];
l_d_bb[get_local_id(0)] = d_bb[i];
read_mem_fence(CLK_LOCAL_MEM_FENCE);
float l_d_cc[get_local_id(0)] = l_d_aa[get_local_id(0)] + l_d_bb[get_local_id(0)];
for(int j = 0; j < get_global_size(0); j++){
l_d_cc[get_local_id(0)] += l_d_bb[j];
}
d_cc[i] = l_d_cc[get_local_id(0)]; //copy the thread only back to global
}
I apologize if I got the algorithm wrong, but hopefully it conveys how to cache global memory to local memory. Again, on the i5, the local workgroup size can be 1 to 1024, but the ATI7970 is restricted to column sizes of 64, 128, etc.
It is conceptually much more difficult, but the performance for OpenCL is much, much better when using this approach.
Community, please feel free clean up the kernel.
Many things are slowing you down:
1- Abuse of use of global memory. Each global memory access is like 400times slower, and you ONLY use global memory (like 200 read/writes). Global memory must be used only to read at the begining, and write at the end, never as an intermediate value.
2- Your N length is very short. CPU will finish in just 1000 instructions, while all the latencies in the GPU are much slower than this. Because a 100MB copy is much more efficient than a 1 byte copy, there are overheads in the copy operations.
3- Probably, CPU code is being optimized by the compiler into multiplications, while GPU code can't since it is accessing volatile variables like globals.
4- Memory read/writes to/from device are very expensive, if you include this into the calc, CPU will easily win. Also OpenCL buffer and kernels creations are very expensive. Note you are also using blocking write calls, this is much slower than non-blocking calls.