OpenACC and GNU Scientific Library - data movement of gsl_matrix - gsl

I've watched the recorded openacc overview course videos up to lecture 3, which talks about expressing data movement. How would you move a gsl_matrix* from cpu to gpu using copy_in(). For example on the CPU I can do something like,
gsl_matrix *Z = gsl_matrix_calloc(100, 100),
which will give me a 100x100 matrix of zeroes. Now Z is a pointer to a gsl_matrix structure which looks like,
typedef struct{
size_t size1;
size_t size2;
size_t tda;
double * data;
gsl_block * block;
int owner;
} gsl_matrix;
How would I express data movement of Z (which is a pointer) from the CPU to the GPU using copyin()?

I can't speak directly about using GSL within OpenACC data and compute regions but can give you general answer about aggregate types with dynamic data members.
The first thing to try, assuming you're using PGI compilers and a newer NVIDIA device, is CUDA Unified Memory (UVM). Compiling with the flag "-ta=tesla:managed", all dynamically allocated data will be managed by the CUDA runtime so you don't need to manage the data movement yourself. There is overhead involved and caveats but it makes things easier to get started. Note that CUDA 8.0, which ships with PGI 16.9 or later, improves UVM performance.
Without UVM, you need to perform a manual deep copy of the data. Below is the basic idea where you first create the parent structure on the device and perform an shallow copy. Next create the dynamic array, "data" on the device, copy over the initial values to the array, then attach the device pointer for data to the device structure's data pointer. Since "block" is itself an array of structs with dynamic data members, you'll need to loop through the array, creating it's data arrays on the device.
matrix * mat = (matrix*) malloc(sizeof(matrix));
#pragma acc enter data copyin(mat)
// Change this to the correct size of "data" and blocks
#pragma acc enter data copyin(mat.data[0:dataSize]);
#pragma acc enter data copyin(mat.block[0:blockSize]);
for (i=0; i < blockSize; ++i) {
#pragma acc enter data copyin(mat.block[i].data[0:mat.block[i].size])
}
To Delete, walk the structure again, deleting from the bottom-up
for (i=0; i < blockSize; ++i) {
#pragma acc exit data delete(mat.block[i].data)
}
#pragma acc exit data delete(mat.block);
#pragma acc exit data delete(mat.data);
#pragma acc exit data delete(mat);
When you update, be sure to only update scalars or arrays of fundamental data types. i.e., update "data" but not "block". Update does a shallow copy so updating "block" would update host or device pointers leading to illegal addresses.
Finally, be sure to put the matrix variable in a "present" clause when using it in a compute region.
#pragma acc parallel loop present(mat)

Related

OpenCL built-in function select

I am trying to use the select function to choose elements from v1 and v2 based on my 3rd argument, however i do not know a way to access the current components in v1 and v2.
Say if v[i] is more than 5, i want to select v2[i] into results[i] else select v1[i], but i can't access the components that way.
Any advice would be appreciated! I am a super beginner at this btw
__kernel void copy(__global int4* Array1,
__global int* Array2,
__global int* output
)
{
int id = get_local_id(0);
//Reads the contents from array 1 and 2 into local memory
__local int4 local_array1;
__local int local_array2;
local_array1 = Array1[id];
local_array2 = Array2[id];
//Copy the contents of array 1 into an int8 vector called v
int8 v;
/*i have trouble here too, how do i copy into int8 v from int4 data type */
v = vload8(0, Array1);
//Copy the contents of array 2 into two int8 vectors called v1 and v2
int8 v1, v2;
v1 = vload8(0, Array2);
v2 = vload8(1, Array2);
//Creates an int8 vector in private memory called results
int8 results;
if (any(v > 5) == 1) {
results = select(v2[what do i do to get current index], v1[i], isgreater(v[i], 5.0));*
vstore8(results, 0, output);
}
else {
results.lo = v1.lo;
results.hi = v2.lo;
vstore8 (results, 0, output);
}
}
You're trying to access vector elements via the [] operator. That's illegal in OpenCL. It might work with some OpenCL compilers but it's still undefined behaviour.
The "official" way to access vector elements is by 1) vector.X or 2) vector.sX
As you noticed, this does not allow dynamic access.
The reason is: a vector is not an array. A vector is supposed to map to a hardware "vector register" (or multiple registers). E.g. a "float8" will map to a single 256bit AVX register on AVX2 CPU, or two 128bit AVX registers on AVX1 CPU.
OpenCL does not have an operator to dynamically access vector elements. Perhaps a missing feature, but it reflects the reality of vectorized hardware: most instructions only operate on entire hardware vector registers, not on their individual elements. If you want to work on a vector element selected dynamically, you have to extract it from the vector. Here are several ways to do it.
Using vectors makes sense in some specific cases; IMO they're useful mainly for two things: 1) when you have a bunch of values logically tied together (e.g. colors in a pixel) and 99.99% of time you don't need to access individual values; 2)
you have hardware with vector registers (e.g. a VLIW CPU or GPU) and your OpenCL compiler can't "autovectorize" the code, so you need to manually write vectorized code to get reasonable performance.
In your code, i'd simply change __global int4* Array1 to __global int* Array1, write the kernel without using vectors (simply indexing as a normal array), and see how it performs. If you're targeting modern Nvidia/AMD GPUs, you don't need vectors at all to get good performance.

multi dimensional array allocation in chunks

I was poking around with multidimensional arrays today, and i came across blog which distinguishes rectangular arrays, and jagged arrays; usually i would do this on both jagged and rectangular:
Object** obj = new Obj*[5];
for (int i = 0; i < 5; ++i)
{
obj[i] = new Obj[10];
}
but in that blog it was said that if i knew that the 2d array was rectangular then i'm better off allocating the entire thing in a 1d array and use an improvised way of accessing the elements, something like this:
Object* obj = new Obj[rows * cols];
obj[x * cols + y];
//which should have been obj[x][y] on the previous implementation
I somehow have a clue that allocating a continuous memory chunk would be good, but i don't really understand how big of a difference this would make, can somebody explain?
First and less important, when you allocate and free your object you only need to do a single allocation/deallocation.
More important: when you use the array you basically get to trade a multiplication against a memory access. On modern computers, memory access is much much much slower than arithmetic.
That's a bit of a lie, because much of the slowness of memory accesses gets hidden by caches -- regions of memory that are being accessed frequently get stored in fast memory inside, or very near to, the CPU and can be accessed faster. But these caches are of limited size, so (1) if your array isn't being used all the time then the row pointers may not be in the cache and (2) if it is being used all the time then they may be taking up space that could otherwise be used by something else.
Exactly how it works out will depend on the details of your code, though. In many cases it will make no discernible difference one way or the other to the speed of your program. You could try it both ways and benchmark.
[EDITED to add, after being reminded of it by Peter Schneider's comment:] Also, if you allocate each row separately they may end up all being in different parts of memory, which may make your caches a bit less effective -- data gets pulled into cache in chunks, and if you often go from the end of one row to the start of the next then you'll benefit from that. But this is a subtle one; in some cases having your rows equally spaced in memory may actually make the cache perform worse, and if you allocate several rows in succession they may well end up (almost) next to one another in memory anyway, and in any case it probably doesn't matter much unless your rows are quite short.
Allocating a 2D array as a one big chunk permits the compiler to generate a more efficient code than doing it in multiple chunks. At least, there would be one pointer dereferencing operation in one chunk approach. BTW, declaring the 2D array like this:
Object obj[rows][cols];
obj[x][y];
is equivalent to:
Object* obj = new Obj[rows * cols];
obj[x * cols + y];
in terms of speed. But the first one in not dynamic (you need to specify the values of "rows" and "cols" at compile time.
By having one large contiguous chunk of memory, you may get improved performance because there is more chance that memory accesses are already in the cache. This idea is called cache locality. We say the large array has better cache locality. Modern processors have several levels of cache. The lowest level is generally the smallest and the fastest.
It still pays to access the array in meaningful ways. For example, if data is stored in row-major order and you access it in column-major order, you are scattering your memory accesses. At certain sizes, this access pattern will negate the advantages of caching.
Having good cache performance is far preferable to any concerns you may have about multiplying values for indexing.
If one of the dimensions of your array is a compile time constant you can allocate a "truly 2-dimensional array" in one chunk dynamically as well and then index it the usual way. Like all dynamic allocations of arrays, new returns a pointer to the element type. In this case of a 2-dimensional array the elements are in turn arrays -- 1-dimensional arrays. The syntax of the resulting element pointer is a bit cumbersome, mostly because the dereferencing operator*() has a lower precedence than the indexing operator[](). One possible allocation statement could be int (*arr7x11)[11] = new int[7][11];.
Below is a complete example. As you see, the innermost index in the allocation can be a run-time value; it determines the number of elements in the allocated array. The other indices determine the element type (and hence element size as well as overall size) of the dynamically allocated array, which of course must be known to perform the allocation. As discussed above, the elements are themselves arrays, here 1-dimensional arrays of 11 ints.
#include<cstdio>
using namespace std;
int main(int argc, char **argv)
{
constexpr int cols = 11;
int rows = 7;
// overwrite with cmd line arg if present.
// if scanf fails, default is retained.
if(argc >= 2) { sscanf(argv[1], "%d", &rows); }
// The actual allocation of "rows" elements of
// type "array of 'cols' ints". Note the brackets
// around *arr7x11 in order to force operator
// evaluation order. arr7x11 is a pointer to array,
// not an array of pointers.
int (*arr7x11)[cols] = new int[rows][cols];
for(int row = 0; row<rows; row++)
{
for(int col = 0; col<cols; col++)
{
arr7x11[row][col] = (row+1)*1000 + col+1;
}
}
for(int row = 0; row<rows; row++)
{
for(int col = 0; col<cols; col++)
{
printf("%6d", arr7x11[row][col]);
}
putchar('\n');
}
return 0;
}
A sample session:
g++ -std=c++14 -Wall -o 2darrdecl 2darrdecl.cpp && ./2darrdecl 3
1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011

Prealloc memory list

We try to develop a realtime application. In this program 4 cameras send 100 times a second there image array to a method.
In this method I have to make copy of each array. (Used for ImageProcessing in other thread).
I would like to store the last 100 images of each camera in a list.
The problem is: How to prealloc such memory in list (new them in constructor?).
I would like to use something like a ringbuffer with fixed size, allocated memories array and fifo principal.
Any idea how?
Edit1:
Example pseudo code:
// called from writer thread
void receiveImage(const char *data, int length)
{
Image *image = images.nextStorage();
std::copy(data, data + length, image->data);
}
// prealloc
void preallocImages()
{
for (int i = 0; i < 100; i++)
images.preAlloc(new Image(400, 400));
}
// consumer thread
void imageProcessing()
{
Image image = image.WaitAndGetImage();
// ... todo
}
Say you create an Image class to hold the data for an image, having a ring buffer amounts to something like:
std::vector<Image> images(100);
int next = 0;
...
while (whatever)
{
images[next++] = get_image();
next %= images.size();
}
You talk about preallocating memory: each Image constructor can own the task of preallocating memory for its own image. If could do that with new, or if you have fixed-size images that aren't particularly huge you could try a corrspondingly sized array in the Image class... that way the all image data will be kept contiguously in memory - it might be a little faster to iterate images "in order". Note that simply having allocated virtual addresses doesn't mean there's physical backing memory yet, and that stuff may still be swapped out into virtual memory. If you have memory access speed issues, you might want to think about scanning over the memory for an image you expect to use shortly before using it or using OS functions to advise the OS of your intended memory use patterns. Might as well get something working and profile it first ;-).
For FIFO handling - just have another variable also starting at 0, if it's != next then you can "process" the image at that index in the vector, then increment the variable until it catches up with next.

What's the proper way to declare and initialize a (large) two dimensional object array in c++?

I need to create a large two dimensional array of objects. I've read some related questions on this site and others regarding multi_array, matrix, vector, etc, but haven't been able to put it together. If you recommend using one of those, please go ahead and translate the code below.
Some considerations:
The array is somewhat large (1300 x 1372).
I might be working with more than one of these at a time.
I'll have to pass it to a function at some point.
Speed is a large factor.
The two approaches that I thought of were:
Pixel pixelArray[1300][1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i][j].setOn(true);
...
}
}
and
Pixel* pixelArray[1300][1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i][j] = new Pixel();
pixelArray[i][j]->setOn(true);
...
}
}
What's the right approach/syntax here?
Edit:
Several answers have assumed Pixel is small - I left out details about Pixel for convenience, but it's not small/trivial. It has ~20 data members and ~16 member functions.
Your first approach allocates everything on stack, which is otherwise fine, but leads to stack overflow when you try to allocate too much stack. The limit is usually around 8 megabytes on modern OSes, so that allocating arrays of 1300 * 1372 elements on stack is not an option.
Your second approach allocates 1300 * 1372 elements on heap, which is a tremendous load for the allocator, which holds multiple linked lists to chunks of allocted and free memory. Also a bad idea, especially since Pixel seems to be rather small.
What I would do is this:
Pixel* pixelArray = new Pixel[1300 * 1372];
for(int i=0; i<1300; i++) {
for(int j=0; j<1372; j++) {
pixelArray[i * 1372 + j].setOn(true);
...
}
}
This way you allocate one large chunk of memory on heap. Stack is happy and so is the heap allocator.
If you want to pass it to a function, I'd vote against using simple arrays. Consider:
void doWork(Pixel array[][]);
This does not contain any size information. You could pass the size info via separate arguments, but I'd rather use something like std::vector<Pixel>. Of course, this requires that you define an addressing convention (row-major or column-major).
An alternative is std::vector<std::vector<Pixel> >, where each level of vectors is one array dimension. Advantage: The double subscript like in pixelArray[x][y] works, but the creation of such a structure is tedious, copying is more expensive because it happens per contained vector instance instead of with a simple memcpy, and the vectors contained in the top-level vector must not necessarily have the same size.
These are basically your options using the Standard Library. The right solution would be something like std::vector with two dimensions. Numerical libraries and image manipulation libraries come to mind, but matrix and image classes are most likely limited to primitive data types in their elements.
EDIT: Forgot to make it clear that everything above is only arguments. In the end, your personal taste and the context will have to be taken into account. If you're on your own in the project, vector plus defined and documented addressing convention should be good enough. But if you're in a team, and it's likely that someone will disregard the documented convention, the cascaded vector-in-vector structure is probably better because the tedious parts can be implemented by helper functions.
I'm not sure how complicated your Pixel data type is, but maybe something like this will work for you?:
std::fill(array, array+100, 42); // sets every value in the array to 42
Reference:
Initialization of a normal array with one default value
Check out Boost's Generic Image Library.
gray8_image_t pixelArray;
pixelArray.recreate(1300,1372);
for(gray8_image_t::iterator pIt = pixelArray.begin(); pIt != pixelArray.end(); pIt++) {
*pIt = 1;
}
My personal peference would be to use std::vector
typedef std::vector<Pixel> PixelRow;
typedef std::vector<PixelRow> PixelMatrix;
PixelMatrix pixelArray(1300, PixelRow(1372, Pixel(true)));
// ^^^^ ^^^^ ^^^^^^^^^^^
// Size 1 Size 2 default Value
While I wouldn't necessarily make this a struct, this demonstrates how I would approach storing and accessing the data. If Pixel is rather large, you may want to use a std::deque instead.
struct Pixel2D {
Pixel2D (size_t rsz_, size_t csz_) : data(rsz_*csz_), rsz(rsz_), csz(csz_) {
for (size_t r = 0; r < rsz; r++)
for (size_t c = 0; c < csz; c++)
at(r, c).setOn(true);
}
Pixel &at(size_t row, size_t col) {return data.at(row*csz+col);}
std::vector<Pixel> data;
size_t rsz;
size_t csz;
};

How to perform deep copying of struct with CUDA? [duplicate]

This question already has answers here:
Copying a struct containing pointers to CUDA device
(3 answers)
Closed 5 years ago.
Programming with CUDA I am facing a problem trying to copy some data from host to gpu.
I have 3 nested struct like these:
typedef struct {
char data[128];
short length;
} Cell;
typedef struct {
Cell* elements;
int height;
int width;
} Matrix;
typedef struct {
Matrix* tables;
int count;
} Container;
So Container "includes" some Matrix elements, which in turn includes some Cell elements.
Let's suppose I dynamically allocate the host memory in this way:
Container c;
c.tables = malloc(20 * sizeof(Matrix));
for(int i = 0;i<20;i++){
Matrix m;
m.elements = malloc(100 * sizeof(Cell));
c.tables[i] = m;
}
That is, a Container of 20 Matrix of 100 Cells each.
How could I now copy this data to the device memory using cudaMemCpy()?
Is there any good way to perform a deep copy of "struct of struct" from host to device?
Thanks for your time.
Andrea
The short answer is "just don't". There are four reasons why I say that:
There is no deep copy functionality in the API
The resulting code you will have to writeto set up and copy the structure you have described to the GPU will be ridiculously complex (about 4000 API calls at a minimum, and probably an intermediate kernel for your 20 Matrix of 100 Cells example)
The GPU code using three levels of pointer indirection will have massively increased memory access latency and will break what little cache coherency is available on the GPU
If you want to copy the data back to the host afterwards, you have the same problem in reverse
Consider using linear memory and indexing instead. It is portable between host and GPU, and the allocation and copy overhead is about 1% of the pointer based alternative.
If you really want to do this, leave a comment and I will try and dig up some old code examples which show what a complete folly nested pointers are on the GPU.