CL_INVALID_ARG_SIZE when setting kernel arg - c++

I'm making an attempt to break away from CUDA and learn OpenCL. I thought an n-body simulation might be a good place to start. I've been using the c++ wrapper, and following the tutorial provided here to get a basic idea of how things should work.
The program loads 2 source files, one for each kernel function. Each is compiled, and built into a separate kernel. On my first attempt, they were in the same kernel. This was an attempt to fix the issue by doing something different.
nbs_forces.cl:
typedef struct body_t {
...
};
__kernel void execute(__global struct body_t* bodies, const float G, const int n){
...
}
nbs_positions.cl:
typedef struct body_t {
...
};
__kernel void execute(__global struct body_t* bodies, const float dt, const int n){
...
}
The buffers are allocated as such:
// create memory buffers
bGravity = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(float));
bBodies = new cl::Buffer(*context, CL_MEM_READ_WRITE,
capacity * sizeof(body_t));
bTimestep = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(float));
bSize = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(int));
After the data is copied to the buffers, and it comes time to run the simulation, I set the kernel arguments as such:
fKernel->setArg(0, *bBodies);
fKernel->setArg(1, *bGravity);
fKernel->setArg(2, *bSize);
pKernel->setArg(0, *bBodies);
pKernel->setArg(1, *bTimestep);
pKernel->setArg(2, *bSize);
cl::NDRange global(capacity);
cl::NDRange local(1);
for (int step = 0; step < steps; step++) {
queue->enqueueNDRangeKernel(*fKernel, cl::NullRange, global, local);
queue->enqueueNDRangeKernel(*pKernel, cl::NullRange, global, local);
}
But on execution of the second line of the simulation function (setKernelArg(1, *bGravity)), the program terminates with CL_INVALID_ARG_SIZE. It seemed like it should be a trivial error to solve, but try as I might, I can't seem to find anything that would cause it.
I've tried passing different data types, including the types provided by opencl (cl_float, etc), but the problem remains. I'm sure I've just done something dumb, but I've been banging my head against the wall for the past few days to no avail.
In my attempt to keep this post short, if there's any critical code I neglected to include, everything can be found on the git repo here.

You have a mismatch between your kernel's expected arguments:
__kernel void execute(__global struct body_t* bodies, const float G, const int n)
(a buffer containing an array of struct body_t, and 2 scalar values)
and what you actually pass to it:
bGravity = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(float));
bBodies = new cl::Buffer(*context, CL_MEM_READ_WRITE,
capacity * sizeof(body_t));
…
bSize = new cl::Buffer(*context, CL_MEM_READ_ONLY, sizeof(int));
…
fKernel->setArg(0, *bBodies);
fKernel->setArg(1, *bGravity);
fKernel->setArg(2, *bSize);
G and n should not be passed as buffers. Instead, the following should do the trick:
const float gravity = …;
const int32_t size = …;
fKernel->setArg(1, gravity);
fKernel->setArg(2, size);

Related

DPC++ access the the nonconst size buffer or access the shared memory pointer in class using MPI

I try to develop a code based on MPI & DPC++ for large-scale simulation. The problem can be summarized as: I want to declare the data size, allocate the data memory inside of my class constructor, and then try to use them in the functions inside of my class. Then I realize that I have to provide a const size to the buffer if I want to use an accessor, but MPI makes an array of different sizes on each rank. Then I find that it is also not possible to use shared memory, because in DPC++ I cannot use this pointer, and the array or matrix allocated in the class cannot be used in subfunction. I am confused and have no idea about that.
the code is like this:
class abc{
queue Q{};
std::array<int, constsize> e;
std::array<double, constsize>t;
abc(){
ua = malloc_shared<double>(local_size, this->Q);
}
void b();
}
void abc::b(){
for(int i=0;i<constsize;i++){
e[i]=i;
t[i]=2*i;
}
buffer<int> ee{e};
buffer<double> tt{t};
auto ini2 = this->Q.submit([&](handler &h)
{ accessor eee{ee, h, read_only};
accessor ttt{tt, h, read_only};
h.parallel_for(range{size1, size2, size3}, [=](id<3> idx)
double eu=ua[id[0]];
int aa=eee[id[1]];
double cc=ttt[id[2]];
}
}
e and t can be accessed because they have const size, and I can use the buffer. But ua has local size, and it depends on MPI, so I cannot use buffer, and shared memory also cannot be used in sub-function.
Any help with this?

Tensorflow GPU new op memory allocation

I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data() should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
}
}
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
//
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
ExampleCudaKernel<T>
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA
When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data());
}
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.

Copy huge structure of arrays to GPU

I need to transform an existing Code about SPH (=Smoothed Particle Hydrodynamics) into a code that can be run on a GPU.
Unfortunately, it has a lot of data structure that I need to copy from the CPU to the GPU. I already looked up in the web and I thought, that I did the right thing for my copying-code, but unfortunately, I get an error (something with unhandled exception).
When I opened the Debugger, I saw that there is no information passed to my variables that should be copied to the GPU. It's just saying "The memory could not be read".
So here is an example of one data structure that needs to be copied to the GPU:
__device__ struct d_particle_data
{
float Pos[3]; /*!< particle position at its current time */
float PosMap[3]; /*!< initial boundary particle postions */
float Mass; /*!< particle mass */
float Vel[3]; /*!< particle velocity at its current time */
float GravAccel[3]; /*!< particle acceleration due to gravity */
}*d_P;
and I pass it on the GPU with the following:
cudaMalloc((void**)&d_P, N*sizeof(sph_particle_data));
cudaMemcpy(d_P, P, N*sizeof(d_sph_particle_data), cudaMemcpyHostToDevice);
The data structure P looks the same as the data structure d_P. Does anybody can help me?
EDIT
So, here's a pretty small part of that code:
First, the headers I have to use in the code:
Allvars.h: Variables that I need on the host
struct particle_data
{
float a;
float b;
}
*P;
proto.h: Header with all the functions
extern void main_GPU(int N, int Ntask);
Allvars_gpu.h: all the variables that have to be on the GPU
__device__ struct d_particle_data
{
float a;
float b;
}
*d_P;
So, now I call from the .cpp-File the -.cu-File:
hydra.cpp:
#include <stdio.h>
#include <cuda_runtime.h>
extern "C" {
#include "proto.h"
}
int main(void) {
int N_gas = 100; // Number of particles
int NTask = 1; // Number of CPUs (Code has MPI-stuff included)
main_GPU(N_gas,NTask);
return 0;
}
Now, the action takes place in the .cu-File:
hydro_gpu.cu:
#include <cuda_runtime.h>
#include <stdio.h>
extern "C" {
#include "Allvars_gpu.h"
#include "allvars.h"
#include "proto.h"
}
__device__ void hydro_evaluate(int target, int mode, struct d_particle_data *P) {
int c = 5;
float a,b;
a = P[target].a;
b = P[target].b;
P[target].a = a+c;
P[target].b = b+c;
}
__global__ void hydro_particle(struct d_particle_data *P) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
hydro_evaluate(i,0,P);
}
void main_GPU(int N, int Ntask) {
int Blocks;
cudaMalloc((void**)&d_P, N*sizeof(d_particle_data));
cudaMemcpy(d_P, P, N*sizeof(d_particle_data), cudaMemcpyHostToDevice);
Blocks = (N+N-1)/N;
hydro_particle<<<Blocks,N>>>(d_P);
cudaMemcpy(P, d_P, N*sizeof(d_particle_data), cudaMemcpyDeviceToHost);
cudaFree(d_P);
}
The really short answer is probably not to declare *d_P as a static __device__ symbol. Those cannot be passed as device pointer arguments to cudaMalloc, cudaMemcpy, or kernel launches and your use of __device__ is both unecessary and incorrect in this example.
If you make that change, your code might start working. Note that I lost interest in trying to actually compile your MCVE code some time ago, and there might well be other problems, but I'm too bored with this question to look for them. This answer has mostly been added to get this question off the unanswered queue for the CUDA tag.

C++, ECS and Saving / Loading

I have a program that employs an entity-component-system framework. Essentially this means that I have a collection of entities that have various components attached to them. Entities are actually just integer ID numbers, and components are attached to them by mapping the component to the specified ID number of the entity.
Now, I need to store collections of entities and the associated components to a file that can be modified later on, so basically I need a saving and loading functionality. However, being somewhat a newcomer to C++, I have hard time figuring out how to exactly do this.
Coming from Java and C#, my first choice would be to serialize the objects into, say, JSON, and then deserialize them when the JSON is loaded. However, C++ does not have any reflection features. So, the question is: how do I save and load C++ objects? I don't mean the actual file operations, I mean the way the objects and structs should be handled in order to preserve them between program launches.
One way of doing is to create Persistent Objects in C++, and store the your data.
check out the following links:
C++ object persistence library similar to eternity
http://sourceforge.net/projects/litesql/
http://en.wikipedia.org/wiki/ODB_(C%2B%2B)
http://drdobbs.com/cpp/184408893
http://tools.devshed.com/c/a/Web-Development/C-Programming-Persistence/
C++ doesn't support persistence directly (there are proposals for adding persistence and reflection to C++ in the future). Persistence support is not as trivial as it may seem at first. The size and memory layout of the same object may vary from one platform to another. Different byte ordering, or endian-ness, complicate matters even further. To make an object persistent, we have to reserve its state in a non-volatile storage device. ie: Write a persistent object to retain its state outside the scope of the program in which it was created.
Other Way, is to store the objects into an array, then push the array buffer to a file.
The advantage are that the disk platters don't have waste time ramping up and also the writing can be performed contiguously.
You can increase the performance by using threads. Dump the objects to a buffer, once done trigger a thread to handle the output.
Example:
The following code has not been compiled and is for illustrative purposes only.
#include <fstream>
#include <algorithm>
using std::ofstream;
using std::fill;
#define MAX_DATA_LEN 1024 // Assuming max size of data be 1024
class stream_interface
{
virtual void load_from_buffer(const unsigned char *& buf_ptr) = 0;
virtual size_t size_on_stream(void) const = 0;
virtual void store_to_buffer(unsigned char *& buf_ptr) const = 0;
};
struct Component
: public stream_interface,
data_length(MAX_DATA_LEN)
{
unsigned int entity;
std::string data;
const unsigned int data_length;
void load_from_buffer(const unsigned char *& buf_ptr)
{
entity = *((unsigned int *) buf_ptr);
buf_ptr += sizeof(unsigned int);
data = std::string((char *) buf_ptr);
buf_ptr += data_length;
return;
}
size_t size_on_stream(void) const
{
return sizeof(unsigned int) + data_length;
}
void store_to_buffer(unsigned char *& buf_ptr) const
{
*((unsigned int *) buf_ptr) = entity;
buf_ptr += sizeof(unsigned int);
std::fill(buf_ptr, 0, data_length);
strncpy((char *) buf_ptr, data.c_str(), data_length);
buf_ptr += data_length;
return;
}
};
int main(void)
{
Component c1;
c1.data = "Some Data";
c1.entity = 5;
ofstream data_file("ComponentList.bin", std::ios::binary);
// Determine size of buffer
size_t buffer_size = c1.size_on_stream();
// Allocate the buffer
unsigned char * buffer = new unsigned char [buffer_size];
unsigned char * buf_ptr = buffer;
// Write / store the object into the buffer.
c1.store_to_buffer(buf_ptr);
// Write the buffer to the file / stream.
data_file.write((char *) buffer, buffer_size);
data_file.close();
delete [] buffer;
return 0;
}

How to access a class from one cuda kernel in the next kernel

I have a dev variable which I used to allocate space on the device using a class header.
Neu *dev_NN;
cudaStatus = cudaMalloc((void**)&dev_NN, sizeof(Neu));
Then I call a kernel which initialises the class on the GPU.
KGNN<<<1, threadsPerBlock>>>(dev_LaySze, dev_NN);
in the kernel
__global__ void KGNN(int * dev_LaySze, Neu * NN)
{
...
NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
}
After the return of this kernel I want to use another kernel to input data to class methods and retrieve output data (the allocators and copies are already done and work), such as
__global__ void KGFF(double *dev_inp, double *dev_outp, int *DataSize)
{
int i = threadIdx.x;
...
NN.Analyse(dev_inp, dev_outp, DataSize );
}
The second kernel knows nothing about the class that was created. As you would expect NN is unrecognised. How do I access the first NN without re-creating the class and re-initialising it? The second kernel has to be called several times, remembering the changes it made to the class variables earlier. I don't want to use the class with the CPU, only the GPU, and I don't want to pass it back and forth each time.
I don't think this has anything to do with CUDA, actually. I believe a similar problem would be observed if you tried this in ordinary C++ (assuming the pointer to NN is not a global variable).
The key aspect of the solution as pointed out by Park Young-Bae is simply to pass the pointer to the allocated space for NN to both kernels. There were a few other changes that I think needed to be made to what you have shown, according to my understanding of what you are trying to do (since you haven't posted a complete code.) Here's a fully worked example:
$ cat t635.cu
#include <stdio.h>
class MC {
int md;
public:
__host__ __device__ int get_md() { return md;}
__host__ __device__ MC(int val) { md = val; }
};
__global__ void kernel1(MC *d){
*d = MC(3);
}
__global__ void kernel2(MC *d){
printf("val = %d\n", d->get_md());
}
int main(){
MC *d_obj;
cudaMalloc(&d_obj, sizeof(MC));
kernel1<<<1,1>>>(d_obj);
kernel2<<<1,1>>>(d_obj);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t635 t635.cu
$ ./t635
val = 3
$
The other changes I suggest:
in your first kernel, you're passing a pointer (NN) (which presumably you have made a device allocation for), and then you are creating an opject and copying that object to the allocated space. In that case I think you need:
*NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
in your second kernel, if NN is a pointer, we must use:
NN->Analyse(dev_inp, dev_outp, DataSize );
I have made those two changes to my posted example. Again, I think this is all just C++ mechanics, not anything specific to CUDA.