Create grid of threads in OpenCL - concurrency

I wrote a kernel for OpenCL where I initialise all the elements of a 3D array to -> i*i*i + j*j*j. I'm now having problems in creating a grid of threads to do the initialisation of the elements (concurrently). I know that the code that I have now only uses 3 threads, how can I expand on that?
Please help. I'm new to OpenCL, so any suggestion or explanation might be handy. Thanks!
This is code:
_kernel void initialize (
int X;
int Y;
int Z;
_global float*A) {
// Get global position in X direction
int dirX = get_global_id(0);
// Get global position in Y direction
int dirY = get_global_id(1);
// Get global position in Z direction
int dirZ = get_global_id(2);
int A[2000][100][4];
int i,j,k;
for (i=0;i<2000;i++)
{
for (j=0;j<100;j++)
{
for (k=0;k<4;k++)
{
A[dirX*X+i][dirY*Y+j][dirZ*Z+k] = i*i*i + j*j*j;
}
}
}
}

You create the buffer to store your output 'A' in the calling (host) code. This is passed to your kernel as a pointer, which is correct in your function definition above. However you don't need to declare it again inside your kernel function, so remove the line int A[2000][100][4];.
You can simplify the code greatly. Using the 3D global ID to indicate the 3D index into the array for each work-item, you could change the loop as follows (assuming that for a given i and j, all elements along Z should have the same value):
__kernel void initialize (__global float* A) {
// cast required so that kernel compiler knows the array dimensions
__global float (*a)[2000][100][4] = A;
// Get global position in X direction
int i = get_global_id(0);
// Get global position in Y direction
int j = get_global_id(1);
// Get global position in Z direction
int k = get_global_id(2);
(*a)[i][j][k] = i*i*i + j*j*j;
}
In your calling code you would then create the kernel with a global work-size of 2000x100x4.
Practically this is a lot of work items to schedule, so you would likely get better performance from a global (one-dimensional) work-size of 2000 and a loop inside the kernel, e.g.:
__kernel void initialize (__global float* A) {
// cast required so that kernel compiler knows the array dimensions
__global float (*a)[2000][100][4] = A;
// Get global position in X direction
int i = get_global_id(0);
for (j=0;j<100;j++) {
for (k=0;k<4;k++) {
(*a)[i][j][k] = i*i*i + j*j*j;
}
}
}

Related

C++ code for Microsoft Kinect - trying to dynamically allocate array of target positions

So I'm trying to modify the Kinect BodyBasicsD2D code so that a fixed number of "target positions" appear on the screen (as ellipses) for the user to move his hand toward. I'm having trouble creating the initial target positions.
This is my code in the header file for the allocation of the array of target positions (these are a public field of the CBodyBasics class, already built into the original BodyToBasics program):
D2D1_POINT_2F* targetPositions = NULL;
int numTargets = 3;
Then I have a function "GenerateTargetPositions" which is supposed to generate 3, in this case, target positions to be passed into the "DrawTargetPositions" function.
void CBodyBasics::GenerateTargetPositions(D2D1_POINT_2F * targetPositions, int numTargets)
{
targetPositions = new D2D1_POINT_2F[numTargets];
RECT rct;
GetClientRect(GetDlgItem(m_hWnd, IDC_VIDEOVIEW), &rct);
int width = rct.right;
int height = rct.bottom;
FLOAT x;
FLOAT y;
D2D1_POINT_2F tempPoint;
for (int i = 0; i < numTargets; i++) {
x = 1.0f*i*width / numTargets;
y = 1.0f*i*height / numTargets;
tempPoint = D2D1::Point2F(x, y);
targetPositions[i] = tempPoint;
}
}
My DrawTargetPositions function is:
void CBodyBasics::DrawTargetPositions(D2D1_POINT_2F * targetPositions, int numTargets)
{
D2D1_ELLIPSE ellipse;
for (int i = 0; i < numTargets; i++)
{
ellipse = D2D1::Ellipse(targetPositions[i], 50.f, 50.f);
m_pRenderTarget->FillEllipse(ellipse, m_pSilverBrush);
}
}
When I try to run my code, I get the error that both "targetPositions" and "targetPositions[i]" is NULL (and thus my GenerateTargetPositions function must not be working properly). I believe that targetPositions[i] is a struct (a point with x and y values) so I am wondering if this may be the reason for my errors.
I call GenerateTargetPositions and DrawTargetPositions before the main "while" loop in my code so that each function is not being called on each iteration (there are many iterations of through the while loop because this is an interactive Microsoft Kinect, recording one's movements).
Any suggestions and advice would be greatly appreciated. Thanks so much!

Thrust - sorting member arrays of class object on gpu

Currently I am working on porting a molecular dynamics simulation program, which was written in plain cpu C++, to Cuda. In short, the program initialises a list of atoms, transfers the control to an object of class CCalc which calculates atomic forces, velocities and positions for 100 (or another number of) iterations, and finally returns to draw the atoms on the screen.
My goal is to have all compute-heavy functions in CCalc run on the gpu. To prevent having to copy all calculation constants in CCalc one by one, I decided to copy the whole class to device memory, pointed to by this__d. Since the drawing function is called from the cpu, the atom list needs to be copied between cpu and gpu every 100 iterations and as such is not a member of CCalc.
In function CCalc::refreshCellList(), I want to rearrange atoms__d (the atom list residing in device memory) such that all atoms in the same cell are grouped together. In other words, atoms__d needs to be sorted with cellId as keys.
As I don't want to waste time implementing my own sorting algorithm, I tried using thrust::sort_by_key(). And here's where I got stuck. The function thrust::sort_by_key() requires device_ptr objects as arguments; however I cannot access cellId since I can only cast this__d to device_ptr, which I can't dereference on the cpu.
Is there a way to do this without having to break down the "class on gpu" structure?
Here is (an excerpt of) my code:
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "device_functions.h"
#include <vector>
#include <thrust\sort.h>
#include <thrust\device_ptr.h>
#define REFRESH_CELL_LISTS 20
struct Atom
{
float pos[3];
float vel[3];
float force[3];
// others
};
std::vector<Atom> atom;
Atom *atom__d;
int noOfAtoms = 0;
class CCalc;
__global__ void makeCells(CCalc *C, Atom *A);
class CCalc
{
private:
CCalc *this__d;
public:
const int nAtoms = noOfAtoms;
int *cellId;
const int nCellX = 4, nCellY = 3;
// many force calculation constants
CCalc()
{
cudaMalloc((void**)&cellId, nAtoms*sizeof(int));
// some other stuff
cudaMalloc((void**)&this__d, sizeof(CCalc));
cudaMemcpy(this__d, this, sizeof(CCalc), cudaMemcpyHostToDevice);
}
// destructor
void relaxStructure(int numOfIterations)
{
cudaMalloc((void**)&atom__d, nAtoms*sizeof(Atom));
cudaMemcpy(atom__d, &atom[0], nAtoms*sizeof(Atom), cudaMemcpyHostToDevice);
for(int iter = 0; iter < numOfIterations; iter++)
{
// stuff
if(!(iter % REFRESH_CELL_LISTS)) refreshCellLists();
// calculate forces; update velocities and positions
}
cudaMemcpy(&atom[0], atom__d, nAtoms*sizeof(Atom), cudaMemcpyDeviceToHost);
cudaFree(atom__d);
}
// functions for force, velocity and position calculation
void refreshCellLists()
{
makeCells<<<(nAtoms + 31) / 32, 32>>>(this__d, atom__d);
cudaDeviceSynchronize();
// sort atom__d array using cellId as keys;
// here is where I would like to use thrust::sort_by_key()
}
};
__global__ void makeCells(CCalc *C, Atom *A)
{
int index = blockDim.x*blockIdx.x + threadIdx.x;
if(index < C->nAtoms)
{
// determine cell x, y based on position
// for now let's use an arbitrary mapping to obtain x, y
int X = (index * index) % C->nCellX;
int Y = (index * index) % C->nCellY;
C->cellId[index] = X + Y * C->nCellX;
}
}
int main()
{
cudaSetDevice(0);
noOfAtoms = 1000; // normally defined by input file
atom.resize(noOfAtoms);
// initialise atom positions, velocities and forces
CCalc calcObject;
while(true) // as long as we need
{
// draw atoms on screen
calcObject.relaxStructure(100);
}
}
Thank you very much.
In other words, atoms__d needs to be sorted with cellId as keys.
It should be possible to do that, at your indicated point in the refreshCellLists method. For simplicity, I have chosen to use the raw device pointers directly (although we could easily wrap these raw device pointers in thrust::device_ptr also) combined with the thrust::device execution policy. Here is a worked example:
$ cat t1156.cu
#include <vector>
#include <thrust/execution_policy.h>
#include <thrust/sort.h>
#include <thrust/device_ptr.h>
#define REFRESH_CELL_LISTS 20
struct Atom
{
float pos[3];
float vel[3];
float force[3];
// others
};
std::vector<Atom> atom;
Atom *atom__d;
int noOfAtoms = 0;
class CCalc;
__global__ void makeCells(CCalc *C, Atom *A);
class CCalc
{
private:
CCalc *this__d;
public:
const int nAtoms = noOfAtoms;
int *cellId;
const int nCellX = 4, nCellY = 3;
// many force calculation constants
CCalc()
{
cudaMalloc((void**)&cellId, nAtoms*sizeof(int));
// some other stuff
cudaMalloc((void**)&this__d, sizeof(CCalc));
cudaMemcpy(this__d, this, sizeof(CCalc), cudaMemcpyHostToDevice);
}
// destructor
void relaxStructure(int numOfIterations)
{
cudaMalloc((void**)&atom__d, nAtoms*sizeof(Atom));
cudaMemcpy(atom__d, &atom[0], nAtoms*sizeof(Atom), cudaMemcpyHostToDevice);
for(int iter = 0; iter < numOfIterations; iter++)
{
// stuff
if(!(iter % REFRESH_CELL_LISTS)) refreshCellLists();
// calculate forces; update velocities and positions
}
cudaMemcpy(&atom[0], atom__d, nAtoms*sizeof(Atom), cudaMemcpyDeviceToHost);
cudaFree(atom__d);
}
// functions for force, velocity and position calculation
void refreshCellLists()
{
makeCells<<<(nAtoms + 31) / 32, 32>>>(this__d, atom__d);
cudaDeviceSynchronize();
// sort atom__d array using cellId as keys;
thrust::sort_by_key(thrust::device, cellId, cellId+nAtoms, atom__d);
}
};
__global__ void makeCells(CCalc *C, Atom *A)
{
int index = blockDim.x*blockIdx.x + threadIdx.x;
if(index < C->nAtoms)
{
// determine cell x, y based on position
// for now let's use an arbitrary mapping to obtain x, y
int X = (index * index) % C->nCellX;
int Y = (index * index) % C->nCellY;
C->cellId[index] = X + Y * C->nCellX;
}
}
int main()
{
cudaSetDevice(0);
noOfAtoms = 1000; // normally defined by input file
atom.resize(noOfAtoms);
// initialise atom positions, velocities and forces
CCalc calcObject;
for (int i = 0; i < 100; i++) // as long as we need
{
// draw atoms on screen
calcObject.relaxStructure(100);
}
}
$ nvcc -std=c++11 -o t1156 t1156.cu
$ cuda-memcheck ./t1156
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
When building thrust codes, especially on windows, I usually make a set of recommendations as summarized here.

Tile mapping error

I am still learning SFML and C++ so please understand that I'm still at the basic level.
This is my first time using this site so IDK if I'm doing this right.
I want to make a function, set, that will allow me to pass a 2d array as an argument and place a tile down whenever there is a 1 in the array. So I can draw maps and things using a matrix. ww is the window width and wh is the window height. In main I made a for loop that would go through tiles and draw them to the window. But when I run this it gives me the error: Segmentation Fault (core dumped) "Error: 139". Is there a better way of doing this and what am I doing wrong?
Thank you.
struct field
{
int rectsizex;
int rectsizey;
RectangleShape * tiles;
field (int s)
{
rectsizex = ww / s;
rectsizey = wh / s;
tiles = new RectangleShape[rectsizex * rectsizey];
}
~field()
{
delete tiles;
}
RectangleShape * set(int ** matr)
{
Vector2f size((ww / rectsizex), (wh / rectsizey));
int posx = ww / rectsizex;
int posy = wh / rectsizey;
for(int x = 0; x<rectsizex; x++)
{
for(int y = 0; y<rectsizey; y++)
{
int i = ((x*rectsizey)+1)+y;
tiles[i].setSize(size);
if(matr[x][y] == 1)
{
tiles[i].setFillColor(Color::Black);
}
else
{
tiles[i].setFillColor(Color::White);
}
tiles[i].setPosition(x * posx, y * posy);
}
}
return tiles;
}
};
Find out what values you are getting for i:
int i = ((x*rectsizey)+1)+y;
This value is definitely outside of your array bounds, hence the error. Use a debugger, or put some print statements after you get the i value.

Change OpenCL function to C++

I am trying to write a code in C++, but after some search on the internet, I found one OpenCL based code is doing exactly the same thing as I want to do in C++. But since this is the first time I see a OpenCL code, I don't know how to change the following functions into c++:
const __global float4 *in_buf;
int x = get_global_id(0);
int y = get_global_id(1);
float result = y * get_global_size(0);
Is 'const __global float4 *in_buf' equivalent to 'const float *in_buf' in c++? And how to change the above other functions? Could anyone help? Thanks.
In general, you should take a look at the OpenCL specification (I'm assuming it's written in OpenCL 1.x) to better understand functions, types and how a kernel works.
Specifically for your question:
get_global_id returns the id of the current work item, and get_global_size returns the total number of work items. Since an OpenCL work-item is roughly equivalent to a single iteration in a sequential language, the equivalent of OpenCL's:
int x = get_global_id(0);
int y = get_global_id(1);
// do something with x and y
float result = y * get_global_size(0);
Will be C's:
for (int x = 0; x < dim0; x++) {
for (int y = 0; y < dim1; y++) {
// do something with x and y
float result = y * dim0;
}
}
As for float4 it's a vector type of 4 floats, roughly equivalent to C's float[4] (except that it supports many additional operators, such as vector arithmetic). Of course in this case it's a buffer, so an appropriate type would be float** or float[4]* - or better yet, just pack them together into a float* buffer and then load 4 at a time.
Feel free to ignore the __global modifier.
const __global float4 *in_buf is not equivalent to const float *in_buf.
The OpenCL uses vector variables, e.g. floatN, where N is e.g. 2,4,8. So float4 is in fact struct { float w, float x, float y, float z} with lot of tricks available to express vector operations.
get_global_id(0) gives you the iterator variable, so essentially replace every get_global_id(dim) with for(int x = 0; x< max[dim]; x++)

How to work around a very large 2d array in C++

I need to create a 2D int array of size 800x800. But doing so creates a stack overflow (ha ha).
I'm new to C++, so should I do something like a vector of vectors? And just encapsulate the 2d array into a class?
Specifically, this array is my zbuffer in a graphics program. I need to store a z value for every pixel on the screen (hence the large size of 800x800).
Thanks!
You need about 2.5 megs, so just using the heap should be fine. You don't need a vector unless you need to resize it. See C++ FAQ Lite for an example of using a "2D" heap array.
int *array = new int[800*800];
(Don't forget to delete[] it when you're done.)
Every post so far leaves the memory management for the programmer. This can and should be avoided. ReaperUnreal is darn close to what I'd do, except I'd use a vector rather than an array and also make the dimensions template parameters and change the access functions -- and oh just IMNSHO clean things up a bit:
template <class T, size_t W, size_t H>
class Array2D
{
public:
const int width = W;
const int height = H;
typedef typename T type;
Array2D()
: buffer(width*height)
{
}
inline type& at(unsigned int x, unsigned int y)
{
return buffer[y*width + x];
}
inline const type& at(unsigned int x, unsigned int y) const
{
return buffer[y*width + x];
}
private:
std::vector<T> buffer;
};
Now you can allocate this 2-D array on the stack just fine:
void foo()
{
Array2D<int, 800, 800> zbuffer;
// Do something with zbuffer...
}
I hope this helps!
EDIT: Removed array specification from Array2D::buffer. Thanks to Andreas for catching that!
Kevin's example is good, however:
std::vector<T> buffer[width * height];
Should be
std::vector<T> buffer;
Expanding it a bit you could of course add operator-overloads instead of the at()-functions:
const T &operator()(int x, int y) const
{
return buffer[y * width + x];
}
and
T &operator()(int x, int y)
{
return buffer[y * width + x];
}
Example:
int main()
{
Array2D<int, 800, 800> a;
a(10, 10) = 50;
std::cout << "A(10, 10)=" << a(10, 10) << std::endl;
return 0;
}
You could do a vector of vectors, but that would have some overhead. For a z-buffer the more typical method would be to create an array of size 800*800=640000.
const int width = 800;
const int height = 800;
unsigned int* z_buffer = new unsigned int[width*height];
Then access the pixels as follows:
unsigned int z = z_buffer[y*width+x];
I might create a single dimension array of 800*800. It is probably more efficient to use a single allocation like this, rather than allocating 800 separate vectors.
int *ary=new int[800*800];
Then, probably encapsulate that in a class that acted like a 2D array.
class _2DArray
{
public:
int *operator[](const size_t &idx)
{
return &ary[idx*800];
}
const int *operator[](const size_t &idx) const
{
return &ary[idx*800];
}
};
The abstraction shown here has a lot of holes, e.g, what happens if you access out past the end of a "row"? The book "Effective C++" has a pretty good discussion of writing good multi dimensional arrays in C++.
One thing you can do is change the stack size (if you really want the array on the stack) with VC the flag to do this is [/F](http://msdn.microsoft.com/en-us/library/tdkhxaks(VS.80).aspx).
But the solution you probably want is to put the memory in the heap rather than on the stack, for that you should use a vector of vectors.
The following line declares a vector of 800 elements, each element is a vector of 800 ints and saves you from managing the memory manually.
std::vector<std::vector<int> > arr(800, std::vector<int>(800));
Note the space between the two closing angle brackets (> >) which is required in order disambiguate it from the shift right operator (which will no longer be needed in C++0x).
Or you could try something like:
boost::shared_array<int> zbuffer(new int[width*height]);
You should still be able to do this too:
++zbuffer[0];
No more worries about managing the memory, no custom classes to take care of, and it's easy to throw around.
There's the C like way of doing:
const int xwidth = 800;
const int ywidth = 800;
int* array = (int*) new int[xwidth * ywidth];
// Check array is not NULL here and handle the allocation error if it is
// Then do stuff with the array, such as zero initialize it
for(int x = 0; x < xwidth; ++x)
{
for(int y = 0; y < ywidth; ++y)
{
array[y * xwidth + x] = 0;
}
}
// Just use array[y * xwidth + x] when you want to access your class.
// When you're done with it, free the memory you allocated with
delete[] array;
You could encapsulate the y * xwidth + x inside a class with an easy get and set method (possibly with overloading the [] operator if you want to start getting into more advanced C++). I'd recommend getting to this slowly though if you're just starting with C++ and not start creating re-usable fully class templates for n-dimension arrays which will just confuse you when you're starting off.
As soon as you get into graphics work you might find that the overhead of having extra class calls might slow down your code. However don't worry about this until your application isn't fast enough and you can profile it to show where the time is lost, rather than making it more difficult to use at the start with possible unnecessary complexity.
I found that the C++ lite FAQ was great for information such as this. In particular your question is answered by:
http://www.parashift.com/c++-faq-lite/freestore-mgmt.html#faq-16.16
You can allocate array on static storage (in file's scope, or add static qualifier in function scope), if you need only one instance.
int array[800][800];
void fn()
{
static int array[800][800];
}
This way it will not go to the stack, and you not have to deal with dynamic memory.
Well, building on what Niall Ryan started, if performance is an issue, you can take this one step further by optimizing the math and encapsulating this into a class.
So we'll start with a bit of math. Recall that 800 can be written in powers of 2 as:
800 = 512 + 256 + 32 = 2^5 + 2^8 + 2^9
So we can write our addressing function as:
int index = y << 9 + y << 8 + y << 5 + x;
So if we encapsulate everything into a nice class we get:
class ZBuffer
{
public:
const int width = 800;
const int height = 800;
ZBuffer()
{
for(unsigned int i = 0, *pBuff = zbuff; i < width * height; i++, pBuff++)
*pBuff = 0;
}
inline unsigned int getZAt(unsigned int x, unsigned int y)
{
return *(zbuff + y << 9 + y << 8 + y << 5 + x);
}
inline unsigned int setZAt(unsigned int x, unsigned int y, unsigned int z)
{
*(zbuff + y << 9 + y << 8 + y << 5 + x) = z;
}
private:
unsigned int zbuff[width * height];
};