How and when should I use pitched pointer with the cuda API? - c++

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.
I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:
What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?

Here is an explanation about pitched pointer and padding in CUDA.
Linear memory vs padded memory
First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:
float* myVector;
cudaMalloc(&myVector, 10000*sizeof(float));
and then access ith element of myVector by classic indexing:
float element = myVector[i];
and if we want to access the next element, we just do:
float next_element = myvector[i+1];
It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.
Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:
float* myArray;
cudaMalloc(&myArray, 10000*sizeof(float));
int row[100]; // number of columns
for (int j=0; j<100; ++j)
row[j] = myArray[i*100+j];
Word alignment
So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.
Bank conflicts
Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.
Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:
size_t pitch;
float* myArray;
cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100); // width in bytes by height
Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:
Allocate the first row.
Check if the number of bytes allocated makes it correctly aligned. For example that it is a multiple of 128.
If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
Reiterate for each row.
At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).
But now, when we want to access an element in a column, we must do:
float* row_start = (float*)((char*)myArray + row * pitch);
float column_element = row_start[column];
The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.
Copying data to/from pitched memory
Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.
Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:
float* host_memory = (float*)malloc(100*100*sizeof(float));
If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:
for (size_t i=0; i<100; ++i) {
cudaMemcpy(host_memory[i*100], myArray[pitch*i],
100*sizeof(float), cudaMemcpyDeviceToHost);
Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:
cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/,
myArray, pitch/*CUDA pitch*/,
100*sizeof(float)/*width in bytes*/, 100/*heigth*/,
Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)
Dealing with 3D arrays
3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.
The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :
When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utility provided by the CUDA utility that takes the pitch into account.
When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant: this pitch is the size of a row, expressed in bytes.
The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.


What's "pitch" in cudaMemcpy2DToArray and cudaMemcpy2DFromArray

I'm converting the deprecated cudaMemcpyToArray and cudaMemcpyFromArray into cudaMemcpy2DToArray and cudaMemcpy2DFromArray. Rather than size of the deprecated calls, the new API calls for width, height, and pitch. The descriptions of spitch and dpitch are correspondingly "Pitch of source memory" and "Pitch of destination memory". I wonder what are those values: size of data items, something else?
More specifically, if I were to copy W*H floats, should I have pitch=sizeof(float), width=W, height=H, or pitch=sizeof(float)*W, width=sizeof(float)*W, height=H, or something else?
It should be:
width = sizeof(float)*W
height = H
The above is for cudaMemcpy2DToArray, and assumes you are transferring from host to device, which would most likely involve an unpitched allocation in host memory as the source.
The pitch of a pitched allocation is the size in bytes of one line of of a 2D allocation, including padding bytes at the end of the line. It is the value returned by cudaMallocPitch, for example. For unpitched allocations, it is still the width of the line, and it is given by W*sizeof(element) where the 2D allocation width is given by W elements each of size sizeof(element).
This question and the link it refers to may also be of interest.

Memory layout of 2D area

How is a 2D area layout in memory? Especially if its a staggered area. Given, to my understanding, that memory is contiguous going from Max down to 0, does the computer allocate each area in the area one after the other? If so, should one of the areas in the area need to be resized, does it shift all the other areas down as to make space for the newly sized area?
If specifics are needed:
linux x86
Revision: (thanks user4581301)
I'm referring to having a vector<vector<T>> where T is some defined type. I'm not talking template programming here unless that doesn't change anything.
The precise details of how std::vector is implemented will vary from compiler to compiler, but more than likely, a std::vector contains a size_t member that stores the length and a pointer to the storage. It allocates this storage using whatever allocator you specify in the template, but the default is to use new, which allocates them off the heap. You probably know this, but typically the heap is the area of RAM below the stack in memory, which grows from the bottom up as the stack grows from the top down, and which the runtime manages by tracking which blocks of it are free.
The storage managed by a std::vectoris a contiguous array of objects, so a vector of twenty vectors of T would contain at least a size_t storing the value 20, and a pointer to an array of twenty structures each containing a length and a pointer. Each of those pointers would point to an array of T, stored contiguously in memory.
If you instead create a rectangular two-dimensional array, such as T table[ROWS][COLUMNS], or a std::array< std::array<T, COLUMNS>, ROWS >, you will instead get a single continuous block of T elements stored in row-major order, that is: all the elements of row 0, followed by all the elements of row 1, and so on.
If you know the dimensions of the matrix in advance, the rectangular array will be more efficient because you’ll only need to allocate one block of memory. This is faster because you’ll only need to call the allocator and the destructor one time, instead of once per row, and also because it will be in one place, not split up over many different locations, and therefore the single block is more likely to be in the processor’s cache.
vectors are thin wrappers around a dynamically allocated array of their elements. For a vector<vector<T>>, this means that the outer vector's internal array contains the inner vector structures, but the inner vectors allocate and manage their own internal arrays separately (the structure contains a pointer to the managed array).
Essentially, the 2D aspect is purely in the program logic; the elements of any given "row" are contiguous, but there is no specified spacial relationship between the rows.
True 2D arrays (where the underlying memory is allocated as a single block) only really happen with C-style arrays declared with 2D syntax (int foo[10][20];) and nested std::array types, or POD types following the same basic design.

C++ memcpy to end of an array

I'm trying to translate C++ into C# and I'm trying to understand, conceptually what the following piece of code does:
memcpy( pQuotes + iStart, pCurQuotes + iSrc, iNumQuotes * sizeof( Quotation ) );
pQuotes is declared: struct Quotation *pQuotes.
pCurQuotes is a CArray of struct Quoataion, iSrc being it's first index. iNumQuotes is the number of elements in pCurQuotes.
What I would like to know is, if iStart is to pQuotes' last index, would the size of pQuotes be increased to accommodate the number of elements in pCurQuotes? In other words, is this function resizing the array then appending to it?
If iStart is to pQuotes' last index, would the size of pQuotes be
increased to accommodate the number of elements in pCurQuotes? In
other words, is this function resizing the array then appending to it?
This is a fundamental limitation of these low-level memory functions. It is your responsibility as the developer to ensure that all buffers are big enough so that you never read or write outside the buffer.
Conceptually, what happens here is that the program will just copy the raw bytes from the source buffer into the destination buffer. It does not perform any bounds- or type-checking. For your problem of converting this to C# the second point is of particular importance, as there are no type conversions invoked during the copying.
would the size of pQuotes be increased to accommodate the number of elements in pCurQuotes?
No. The caller is expected to make sure before making a call to memcpy that pQuotes points to a block of memory sufficient to accommodate iStart+iNumQuotes elements of size sizeof(Quotation).
If you model this behavior with an array Quotation[] in C#, you need to extend the array to size at or above iStart+iNumQuotes elements.
If you are modeling it with List<Quotation>, you could call Add(...) in a loop, and let List<T> handle re-allocations for you.
No, memcpy does not do any resizing. pQuotes is merely a pointer to a space in memory, and the type of pointer determines its size for pointer arithmetic.
All that memcpy does is copy n bytes from a source to a destination. You need to apply some defensive programming techniques to ensure that you do not write beyond the size of your destination, because memcpy won't prevent it!

Efficient (time and space complexity) data structure for dense and sparse matrix

I have to read a file in which is stored a matrix with cars (1=BlueCar, 2=RedCar, 0=Empty).
I need to write an algorithm to move the cars of the matrix in that way:
blue ones move downward;
red ones move rightward;
there is a turn in which all the blue ones move and a turn to move all the red ones.
Before the file is read I don't know the matrix size and if it's dense or sparse, so I have to implement two data structures (one for dense and one for sparse) and two algorithms.
I need to reach the best time and space complexity possible.
Due to the unknown matrix size, I think to store the data on the heap.
If the matrix is dense, I think to use something like:
short int** M = new short int*[m];
short int* M_data = new short int[m*n];
for(int i=0; i< m; ++i)
M[i] = M_data + i * n;
With this structure I can allocate a contiguous space of memory and it is also simple to be accessed with M[i][j].
Now the problem is the structure to choose for the sparse case, and I have to consider also how I can move the cars through the algorithm in the simplest way: for example when I evaluate a car, I need to find easily if in the next position (downward or rightward) there is another car or if it's empty.
Initially I thought to define BlueCar and RedCar objects that inherits from the general Car object. In this objects I can save the matrix coordinates and then put them in:
std::vector<BluCar> sparseBlu;
std::vector<RedCar> sparseRed;
Otherwise I can do something like:
vector< tuple< row, column, value >> sparseMatrix
But the problem of finding what's in the next position still remains.
Probably this is not the best way to do it, so how can I implement the sparse case in a efficient way? (also using a unique structure for sparse)
Why not simply create a memory mapping directly over the file? (assuming your data 0,1,2 is stored in contiguous bytes (or bits) in the file, and the position of those bytes also represents the coordinates of the cars)
This way you don't need to allocate extra memory and read in all the data, and the data can simply and efficiently be accessed with M[i][j].
Going over the rows would be L1-cache friendly.
In case of very sparse data, you could scan through the data once and keep a list of the empty regions/blocks in memory (only need to store startpos and size), which you could then skip (and adjust where needed) in further runs.
With memory mapping, only frequently accessed pages are kept in memory. This means that once you have scanned for the empty regions, memory will only be allocated for the frequently accessed non-empty regions (all this will be done automagically by the kernel - no need to keep track of it yourself).
Another benefit is that you are accessing the OS disk cache directly. Thus no need to keep copying and moving data between kernel space and user space.
To further optimize space- and memory usage, the cars could be stored in 2 bits in the file.
I'll have to move cars with openMP and MPI... Will the memory mapping
work also with concurrent threads?
You could certainly use multithreading, but not sure if openMP would be the best solution here, because if you work on different parts of the data at the same time, you may need to check some overlapping regions (i.e. a car could move from one block to another).
Or you could let the threads work on the middle parts of the blocks, and then start other threads to do the boundaries (with red cars that would be one byte, with blue cars a full row).
You would also need a locking mechanism for adjusting the list of the sparse regions. I think the best way would be to launch separate threads (depending on the size of the data of course).
In a somewhat similar task, I simply made use of Compressed Row Storage.
The Compressed Row and Column (in the next section) Storage formats
are the most general: they make absolutely no assumptions about the
sparsity structure of the matrix, and they don't store any unnecessary
elements. On the other hand, they are not very efficient, needing an
indirect addressing step for every single scalar operation in a
matrix-vector product or preconditioner solve.
You will need to be a bit more specific about time and space complexity requirements. CSR requires an extra indexing step for simple operations, but that is a minor amount of overhead if you're just doing simple matrix operations.
There's already an existing C++ implementation available online as well.

Data structure in C/C++ for multiple variable size arrays

This is the problem at hand:
I have several 10000s of arrays. Each array could be anywhere between 2-15 units in length.
The total length of all the elements in all the arrays and the number of arrays can be computed using some very low cost calculations. But the exact number in each array is not known until some fairly expensive computation is completed.
Since I know the total length of all the elements in all the arrays, I would like to just allocate data for it using just one new/malloc and just set pointers within this allocation. In my current implementation I use memmove to move the data after a certain item is inserted and updates all pointers accordingly.
Is there a better way of doing this?
It's not clear what you mean by better way. If you are looking for something that works faster and can afford some extra memory then you can keep two arrays, one with data, and the other one with the index of the array it belongs. After you added all the data, you can sort by the index and you have all your data split by arrays, finally you sweep the arrays and get the pointer to where each array belongs.
Regarding memory consumption, depending on how many arrays you have, and how big is your data, you can squeeze the index data to the last bits of your data, if you have it bounded by some number. This way, you only need to sort the numbers, and when you are sweeping retrieving the pointer where each array begins, you can clean the top bits.
Since I know the total length of all the elements in all the arrays, I would like to just allocate data for it using just one new/malloc and just set pointers within this allocation.
You can use one large vector. You'll need to manually calculate the offset of each sub-array yourself.
vectors guarantee that their data is stored in contiguous memory, but be careful of maintaining references or pointers to individual elements if the vector is used in such a way that may make it reallocate. Shouldn't be a problem since you're not adding anything beyond the initial size.
int main() {
std::vector<T> vec;
// now you'll need to manually translate the offset of
// a given "array" and then add the offset of the element to that
T someElem = vec[array_offset + element_offset];
Yes, there is a better way:
std::vector<std::vector<Item>> array;
for(int i = 0; i < array.size(); ++i) {
for(int j = 0; j < array[i].size(); j++) {
array[i][j] = Item(some_other_calc());
No pointers, no muss, no fuss.
Are you looking for memory efficiency, speed efficiency, or simplicity?
You can always write or download a dead-simple pool allocator, then pass that as the allocator to the appropriate data structures. Because you know the total size in advance, and never need to resize vectors or add new ones, this can be even simpler than a typical pool allocator. Just malloc all of the storage in one big block, and keep a single pointer to the next block. To allocate n bytes, T *ret = nextBlock; nextBlock += n; return ret;. If your objects are trivial and don't need destruction, you can even just do one big free at the end.
This means you can use any data structure you want, or compare and contrast different ones. A vector of vectors? A giant vector of cells plus a vector of offsets? Something else you came up with that sounds crazy but just might work? You can compare their readability, usability, and performance without worrying about the memory allocation side of things.
(Of course if your goal is speed, packing things this way may not be the best answer. You can often gain a lot of speed by wasting a little space to improve your cache and/or page alignment. You could write a fancy allocator that, e.g., allocates vector space in a transposed way to improve the performance of your algorithm that does column-major where it should do row-major and vice-versa, but at that point, it's probably easier to tweak your algorithms than your allocator.)