Copy huge structure of arrays to GPU

Copy huge structure of arrays to GPU - c++

I need to transform an existing Code about SPH (=Smoothed Particle Hydrodynamics) into a code that can be run on a GPU.
Unfortunately, it has a lot of data structure that I need to copy from the CPU to the GPU. I already looked up in the web and I thought, that I did the right thing for my copying-code, but unfortunately, I get an error (something with unhandled exception).
When I opened the Debugger, I saw that there is no information passed to my variables that should be copied to the GPU. It's just saying "The memory could not be read".
So here is an example of one data structure that needs to be copied to the GPU:
__device__ struct d_particle_data
{
float Pos[3]; /*!< particle position at its current time */
float PosMap[3]; /*!< initial boundary particle postions */
float Mass; /*!< particle mass */
float Vel[3]; /*!< particle velocity at its current time */
float GravAccel[3]; /*!< particle acceleration due to gravity */
}*d_P;
and I pass it on the GPU with the following:
cudaMalloc((void**)&d_P, N*sizeof(sph_particle_data));
cudaMemcpy(d_P, P, N*sizeof(d_sph_particle_data), cudaMemcpyHostToDevice);
The data structure P looks the same as the data structure d_P. Does anybody can help me?
EDIT
So, here's a pretty small part of that code:
First, the headers I have to use in the code:
Allvars.h: Variables that I need on the host
struct particle_data
{
float a;
float b;
}
*P;
proto.h: Header with all the functions
extern void main_GPU(int N, int Ntask);
Allvars_gpu.h: all the variables that have to be on the GPU
__device__ struct d_particle_data
{
float a;
float b;
}
*d_P;
So, now I call from the .cpp-File the -.cu-File:
hydra.cpp:
#include <stdio.h>
#include <cuda_runtime.h>
extern "C" {
#include "proto.h"
}
int main(void) {
int N_gas = 100; // Number of particles
int NTask = 1; // Number of CPUs (Code has MPI-stuff included)
main_GPU(N_gas,NTask);
return 0;
}
Now, the action takes place in the .cu-File:
hydro_gpu.cu:
#include <cuda_runtime.h>
#include <stdio.h>
extern "C" {
#include "Allvars_gpu.h"
#include "allvars.h"
#include "proto.h"
}
__device__ void hydro_evaluate(int target, int mode, struct d_particle_data *P) {
int c = 5;
float a,b;
a = P[target].a;
b = P[target].b;
P[target].a = a+c;
P[target].b = b+c;
}
__global__ void hydro_particle(struct d_particle_data *P) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
hydro_evaluate(i,0,P);
}
void main_GPU(int N, int Ntask) {
int Blocks;
cudaMalloc((void**)&d_P, N*sizeof(d_particle_data));
cudaMemcpy(d_P, P, N*sizeof(d_particle_data), cudaMemcpyHostToDevice);
Blocks = (N+N-1)/N;
hydro_particle<<<Blocks,N>>>(d_P);
cudaMemcpy(P, d_P, N*sizeof(d_particle_data), cudaMemcpyDeviceToHost);
cudaFree(d_P);
}

The really short answer is probably not to declare *d_P as a static __device__ symbol. Those cannot be passed as device pointer arguments to cudaMalloc, cudaMemcpy, or kernel launches and your use of __device__ is both unecessary and incorrect in this example.
If you make that change, your code might start working. Note that I lost interest in trying to actually compile your MCVE code some time ago, and there might well be other problems, but I'm too bored with this question to look for them. This answer has mostly been added to get this question off the unanswered queue for the CUDA tag.

Related

Tensorflow GPU new op memory allocation

I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data() should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
}
}
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
//
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
ExampleCudaKernel<T>
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA

When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data());
}
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.

C++ Vector read access violation Mylast returned 0x8

I really need help on this one cause I am extremely stuck and have no idea what to do.
Edit:
A lot of you guys are saying that I need to use the debugger but let me be clear I have not used C++ for an extremely long time and I've used visual studio for a grand total of 2 weeks so I do not know all the cool stuff it can do with the debugger.
I am a student at university at the beginning of my second year who is trying to work out how to do something mostly by failing.
I AM NOT a professional coder and I don't have all the knowledge that you people have when it comes to these issues and that is why I am asking this question. I am trying my best to show my issue so yes my code contains a lot of errors as I only have a very basic understanding of a lot of C++ principles so can you please keep that in mind when commenting
I'm only posting this here because I can don't know who else to ask right now.
I have a function called world that is suppose to call my render class to draw all the objects inside of its vector to the screen.
#include "C_World.h"
C_World::C_World()
{
// creates an instance of the renderer class to render any drawable objects
C_Renderer *render = new C_Renderer;
}
C_World::~C_World()
{
delete[] render;
}
// adds an object to the world vector
void C_World::addToWorld(C_renderable* a)
{
world_list.push_back(a);
}
void C_World::World_Update()
{
render->ClearScreen();
World_Render();
}
void C_World::World_Render() {
for (int i = 0; i < 1; i++)
{
//render->DrawSprite(world_list[i]->getTexture(), world_list[i]->get_X, world_list[i]->get_Y());
render->DrawSprite(1, 1, 1);
}
}
While testing I commented out the Sprites get functions in order to check if they were causing the issue.
the renderer sprites are added to the vector list in the constructor through the create sprite function
C_Renderer::C_Renderer()
{
// test sprite: Id = 1
CreateSprite("WhiteBlock.png", 250, 250, 1);
}
I thought this might of been the issue so I had it in other functions but this didn't solve anything
Here are the Draw and create Sprite functions
// Creates a sprite that is stored in the SpriteList
// Sprites in the spriteList can be used in the drawSprite function
void C_Renderer::CreateSprite(std::string texture_name,
unsigned int Texture_Width, unsigned int Texture_height, int spriteId)
{
C_Sprite *a = new C_Sprite(texture_name,Texture_Width,
Texture_height,spriteId);
SpriteList.push_back(a);
size_t b = SpriteList.size();
HAPI.DebugText(std::to_string(b));
}
// Draws a sprite to the X and Y co-ordinates
void C_Renderer::DrawSprite(int id,int x,int y)
{
Blit(screen, _screenWidth, SpriteList[id]->get_Texture(),
SpriteList[id]->getTexture_W(), SpriteList[id]->getTexture_H(), x, y);
}
I even added some test code into the create sprite function to check to see if the sprite was being added too the vector list. It returns 1 so I assume it is.
Exception thrown: read access violation.
std::_Vector_alloc<std::_Vec_base_types<C_Sprite *,
std::allocator<C_Sprite *> > >::_Mylast(...) returned 0x8.
that is the full error that I get from the compiler
I'm really really stuck if there is anymore information you need just say and ill post it straight away
Edit 2:
#pragma once
#include <HAPI_lib.h>
#include <vector>
#include <iostream>
#include "C_renderable.h"
#include "C_Renderer.h"
class C_World
{
public:
C_World();
~C_World();
C_Renderer *render = nullptr;
void World_Update();
void addToWorld(C_renderable* a);
private:
std::vector<C_renderable*> world_list;
void C_World::World_Render();
};
#pragma once
#include <HAPI_lib.h>
#include "C_renderable.h"
#include "C_Sprite.h"
#include <vector>
class C_Renderer
{
public:
C_Renderer();
~C_Renderer();
// gets a pointer to the top left of screen
BYTE *screen = HAPI.GetScreenPointer();
void Blit(BYTE *destination, unsigned int destWidth,
BYTE *source, unsigned int sourceWidth, unsigned int sourceHeight,
int posX, int posY);
void C_Renderer::BlitBackground(BYTE *destination,
unsigned int destWidth, unsigned int destHeight, BYTE *source,
unsigned int sourceWidth, unsigned int sourceHeight);
void SetPixel(unsigned int x,
unsigned int y, HAPI_TColour col,BYTE *screen, unsigned int width);
unsigned int _screenWidth = 1750;
void CreateSprite(std::string texture_name,
unsigned int Texture_Width,unsigned int Texture_height, int spriteId);
void DrawSprite(int id, int x, int y);
void ClearScreen();
private:
std::vector<C_Sprite*> SpriteList;
};

I don't say this lightly, but the code you've shown is absolutely terrible. You need to stop and go back several levels in your understanding of C++.
In all likeliness, your crash is the result of a simple "shadowing" issue in one or more of your functions:
C_World::C_World()
{
// creates an instance of the renderer class to render any drawable objects
C_Renderer *render = new C_Renderer;
}
C_World::~C_World()
{
delete[] render;
}
There are multiple things wrong here, and you don't show the definition of C_World but if this code compiles we can deduce that it has a member render, and you have fallen into a common trap.
C_Renderer *render = new C_Renderer;
Because this line starts with a type this is a definition of a new, local variable, render. Your compiler should be warning you that this shadows the class-scope variable of the same name.
What these lines of code
C_World::C_World()
{
// creates an instance of the renderer class to render any drawable objects
C_Renderer *render = new C_Renderer;
}
do is:
. assign an undefined value to `this->render`,
. create a *local* variable `render`,
. construct a dynamic `C_Renderer` presumably on the heap,
. assign that to the *local* variable `render`,
. exit the function discarding the value of `render`.
So at this point the memory is no-longer being tracked, it has been leaked, and this->render is pointing to an undefined value.
You repeat this problem in several of your functions, assigning new results to local variables and doing nothing with them. It may not be this specific instance of the issue that's causing the problem.
Your next problem is a mismatch of new/delete vs new[]/delete[]:
C_World::~C_World()
{
delete[] render;
}
this would result in undefined behavior: this->render is undefined, and delete[] on a non-new[] allocation is undefined.
Most programmers use a naming convention that distinguishes a member variable from a local variable. Two common practices are an m_ prefix or an _ suffix for members, e.g.
class C_World
{
public:
C_Foo* m_foo; // option a
C_Renderer* render_; // option b
// ...
}
Perhaps you should consider using modern C++'s concept of smart pointers:
#include <memory>
class C_World {
// ...
std::unique_ptr<C_Renderer> render_;
// ...
};
C_World::C_World()
: render_(new C_Renderer) // initializer list
{}
But it's unclear why you are using a dynamic allocation here in the first place. It seems like an instance member would be better:
class C_World {
C_Renderer render_;
};
C_World::C_World() : render_() {}

Is using a const method from one object thread safe?

I am not sure if this is a duplicate, I have looked through many posts but they do not seem to be close enough to my question.
I want to use a const method from one object to change other objects concurrently. The program basically needs me to move a particle under the influence of gravity and I want to run this in parallel for all particles. I made a physics class and in that class I have a const method to move a particle object.
Here are some example classes to understand me better.
/**
* Particle.h
*/
#ifndef __Particle__sim__
#define __Particle__sim__
class Particle {
private:
double height;
double velocity;
public:
const double getHeight() const;
const double getVelocity() const;
void setHeight(const double&);
void setVelocity(const double&);
};
#endif
/**
* Physics.h
*/
#ifndef __physics__sim__
#define __physics__sim__
#include <thread>
#include <vector>
#include "Particle.h"
class Physics {
private:
double gravity;
double timeStep;
void moveParticle(Particle&, const double) const;
public:
Physics(const double g, const double t);
void moveParticles(std::vector<Particle>&, const double) const;
};
#endif
/**
* Physics.cpp
*/
#include "Physics.h"
using namespace std;
Physics::Physics(const double g, const double t) : gravity(g), timeStep(t) {}
void Physics::moveParticle(Particle& particle, const double time) const {
// move particle under gravity
}
void Physics::moveParticles(vector<Particles>& particles, const double time) const {
vector<thread> threads;
threads.reserve(particles.size());
for (auto& p : particles) {
threads.push_back(thread(Physics::moveParticle&, this, std::ref(p), time));
}
for (auto& t : threads) {
t.join();
}
}
Here is essentially my main
/**
* main.cpp
*/
#include <vector>
#include "Physics.h"
#include "Particle.h"
using namespace std;
int main() {
vector<Particle> particles;
// insert 100,000 particles
Physics physics = Physics(-9.81, 0.01);
physics.moveParticles(particles, 5.0);
return 0;
}
So is physics.moveParticle(Particle&, const double) thread safe here?
Short & Sweet:
I want to use a method from one Physics object to make multiple threads to move all the Particles in my program and I am not sure if the const method I wrote is thread safe. I can't see why not but I can't justify it.

At first sight, this should be thread safe.
We need to see the implementation of Particle::setHeight to be absolutely sure. What if it did something like write to a global array? Which would be silly but we can't tell for sure.
However, your Particle looks very simple. What would be even more thread safe is to not mutate it at all. Make them immutable and create a new one with each calculation.
You can still change them by assigning the new Particle back to the old Particle.
However if you really want to get into threading, a great technique here is to have two world states: previous and next. These swap with each update step. Each update step reads from previous and writes into next. This lets other threads like graphics display read from previous without constantly locking on little things like Particles.
With that, a immutable Particle does not slow anything down at all. In fact, the compiler will rewrite the machine code for nextParticles[i] = updateParticle(prevParticles[i]) into direct assignments to its final position in memory. That's RVO or NRVO.

It looks thread-safe to me. In particular, if the only (non-immutable) data read or written by each of the spawned threads is the thread's own corresponding Particle object, then there is no shared data as far as the threads are concerned. (The Physics object itself is shared, of course, but since you are not modifying the Physics object during the sub-threads' lifetime, the Physics object is effectively immutable/read-only during the operation, and any read-only access to the Physics object by the spawned threads will not be a source of race conditions)

How to access a class from one cuda kernel in the next kernel

I have a dev variable which I used to allocate space on the device using a class header.
Neu *dev_NN;
cudaStatus = cudaMalloc((void**)&dev_NN, sizeof(Neu));
Then I call a kernel which initialises the class on the GPU.
KGNN<<<1, threadsPerBlock>>>(dev_LaySze, dev_NN);
in the kernel
__global__ void KGNN(int * dev_LaySze, Neu * NN)
{
...
NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
}
After the return of this kernel I want to use another kernel to input data to class methods and retrieve output data (the allocators and copies are already done and work), such as
__global__ void KGFF(double *dev_inp, double *dev_outp, int *DataSize)
{
int i = threadIdx.x;
...
NN.Analyse(dev_inp, dev_outp, DataSize );
}
The second kernel knows nothing about the class that was created. As you would expect NN is unrecognised. How do I access the first NN without re-creating the class and re-initialising it? The second kernel has to be called several times, remembering the changes it made to the class variables earlier. I don't want to use the class with the CPU, only the GPU, and I don't want to pass it back and forth each time.

I don't think this has anything to do with CUDA, actually. I believe a similar problem would be observed if you tried this in ordinary C++ (assuming the pointer to NN is not a global variable).
The key aspect of the solution as pointed out by Park Young-Bae is simply to pass the pointer to the allocated space for NN to both kernels. There were a few other changes that I think needed to be made to what you have shown, according to my understanding of what you are trying to do (since you haven't posted a complete code.) Here's a fully worked example:
$ cat t635.cu
#include <stdio.h>
class MC {
int md;
public:
__host__ __device__ int get_md() { return md;}
__host__ __device__ MC(int val) { md = val; }
};
__global__ void kernel1(MC *d){
*d = MC(3);
}
__global__ void kernel2(MC *d){
printf("val = %d\n", d->get_md());
}
int main(){
MC *d_obj;
cudaMalloc(&d_obj, sizeof(MC));
kernel1<<<1,1>>>(d_obj);
kernel2<<<1,1>>>(d_obj);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t635 t635.cu
$ ./t635
val = 3
$
The other changes I suggest:
in your first kernel, you're passing a pointer (NN) (which presumably you have made a device allocation for), and then you are creating an opject and copying that object to the allocated space. In that case I think you need:
*NN = Neu(dev_LaySze[0], dev_LaySze[1], dev_LaySze[2]);
in your second kernel, if NN is a pointer, we must use:
NN->Analyse(dev_inp, dev_outp, DataSize );
I have made those two changes to my posted example. Again, I think this is all just C++ mechanics, not anything specific to CUDA.

Giving an instance of a class a pointer to a struct

I am trying to get SSE functionality in my vector class (I've rewritten it three times so far. :\) and I'm doing the following:
#ifndef _POINT_FINAL_H_
#define _POINT_FINAL_H_
#include "math.h"
namespace Vector3D
{
#define SSE_VERSION 3
#if SSE_VERSION >= 2
#include <emmintrin.h> // SSE2
#if SSE_VERSION >= 3
#include <pmmintrin.h> // SSE3
#endif
#else
#include <stdlib.h>
#endif
#if SSE_VERSION >= 2
typedef union { __m128 vector; float numbers[4]; } VectorData;
//typedef union { __m128 vector; struct { float x, y, z, w; }; } VectorData;
#else
typedef struct { float x, y, z, w; } VectorData;
#endif
class Point3D
{
public:
Point3D();
Point3D(float a_X, float a_Y, float a_Z);
Point3D(VectorData* a_Data);
~Point3D();
// a lot of not-so-interesting functions
private:
VectorData* _NewData();
}; // class Point3D
}; // namespace Vector3D
#endif
It works! Hurray! But it's slower than my previous attempt. Boo.
I've determined that my bottle neck is the malloc I'm using to get a pointer to a struct.
VectorData* Point3D::_NewData()
{
#if SSE_VERSION >= 2
return ((VectorData*) _aligned_malloc(sizeof(VectorData), 16));
#else
return ((VectorData*) malloc(sizeof(VectorData)));
#endif
}
One of the main problems with using SSE in a class is that it has to be aligned in memory for it to work, which means overloading the new and delete operators, resulting in code like this:
BadVector* test1 = new BadVector(1, 2, 3);
BadVector* test2 = new BadVector(4, 5, 6);
*test1 *= test2;
You can no longer use the default constructor and you have to avoid new like the plague.
My new approach is basically to have the data external from the class so the class doesn't have to be aligned.
My question is: is there a better way to get a pointer to an (aligned on memory) instance of a struct or is my approach really dumb and there's a much cleaner way?

How about:
__declspec( align( 16 ) ) VectorData vd;
?
You can also create your own version of operator new as follows
void* operator new( size_t size, size_t alignment )
{
return __aligned_malloc( size, alignment );
}
which can then make allocationas follows
AlignedData* pData = new( 16 ) AlignedData;
to align at a 16 byte boundary.
If thats no help then i may be misunderstanding what you are asking for ...

You should probably not expect to get improved performance for single-use vectors. Parallel processing shines brightest when you can combine the parallel processing with some volume, i.e. when processing many vectors in sequence.

I fixed it. :O
It was really rather easy. All I had to do was turn
VectorData* m_Point;
into
VectorData m_Point;
and my problems are gone, with no need for malloc or aligning.
But I appreciate everyone's help! :D

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Copy huge structure of arrays to GPU - c++

Related

Tensorflow GPU new op memory allocation

C++ Vector read access violation Mylast returned 0x8

Is using a const method from one object thread safe?

How to access a class from one cuda kernel in the next kernel

Giving an instance of a class a pointer to a struct

Categories

Resources