2D arrays with contiguous rows on the heap memory for cudaMemCpy2D() - c++

CUDA documentation recommends the use of cudaMemCpy2D() for 2D arrays (and similarly cudaMemCpy3D() for 3D arrays) instead of cudaMemCpy() for better performance as the former allocates device memory more appropriately. On the other hand, all cudaMemCpy functions, just like memcpy(), require contiguous allocation of memory.
This is all fine if I create my (host) array as, for example, float myArray[h][w];. However, it most likely will not work if I use something like:
float** myArray2 = new float*[h];
for( int i = 0 ; i < h ; i++ ){
myArray2[i] = new float[w];
}
This is not a big problem except when one is trying to implement CUDA into an existing project, which is the problem I am facing. Right now, I create a temporary 1D array, copy the contents of my 2D array into it and use cudaMemCpy() and repeat the whole process to get the results after the kernel launch, but this does not seem an elegant/efficient way.
Is there a better way to handle this situation? Specifically, is there a way to create a genuine 2D array on the heap with contiguously allocated rows so that I can use cudaMemCpy2D()?
P.S: I couldn't find the answer to this question the following previous similar posts:
Allocate 2D array with cudaMallocPitch and copying with
cudaMemcpy2D
Assigning memory for contiguous 2D array
Dynamic 2d Array non contiguous memory c++ (The second answer in
this one is rather puzzling.)

Allocate the big array, then use pointer arithmetic to find the actual beginnings of the rows.
float* bigArray = new float[h * w]
float** myArray2 = new float*[h]
for( int i = 0 ; i < h ; i++ ){
myArray2[i] = &bigArray[i * w];
}
Your myArray2 array of pointers gives you C/C++ style two dimensional arrays behavior, bigArray gives you the contiguous block of memory needed by CUDA.

Related

How to pass dynamic and static 2d arrays as void pointer?

for a project using Tensorflow's C API I have to pass a void pointer (void*) to a method of Tensorflow. In the examples the void* points to a 2d array, which also worked for me. However now I have array dimensions which do not allow me to use the stack, which is why I have to use a dynamic array or a vector.
I managed to create a dynamic array with the same entries like this:
float** normalizedInputs;//
normalizedInputs = new float* [noCellsPatches];
for(int i = 0; i < noCellsPatches; ++i)
{
normalizedInputs[i] = new float[no_input_sizes];
}
for(int i=0;i<noCellsPatches;i++)
{
for(int j=0;j<no_input_sizes;j++)
{
normalizedInputs[i][j]=inVals.at(no_input_sizes*i+j);
////
////
//normalizedInputs[i][j]=(inVals.at(no_input_sizes*i+j)-inputMeanValues.at(j))/inputVarValues.at(j);
}
}
The function call needing the void* looks like this:
TF_Tensor* input_value = TF_NewTensor(TF_FLOAT,in_dims_arr,2,normalizedInputs,num_bytes_in,&Deallocator, 0);
In argument 4 you see the "normalizedInputs" array. When I run my program now, the calculated results are totally wrong. When I go back to the static array they are right again. What do I have to change?
Greets and thanks in advance!
Edit: I also noted that the TF_Tensor* input_value holds totally different values for both cases (for dynamic it has many 0 and nan entries). Is there a way to solve this by using a std::vector<std::vector<float>>?
Respectively: is there any valid way pass a consecutive dynamic 2d data structure to a function as void*?
In argument 4 you see the "normalizedInputs" array. When I run my program now, the calculated results are totally wrong.
The reason this doesn't work is because you are passing the pointers array as data. In this case you would have to use normalizedInputs[0] or the equivalent more explicit expression &normalizedInputs[0][0]. However there is another bigger problem with this code.
Since you are using new inside a loop you won't have contiguous data which TF_NewTensor expects. There are several solutions to this.
If you really need a 2d-array you can get away with two allocations. One for the pointers and one for the data. Then set the pointers into the data array appropriately.
float **normalizedInputs = new float* [noCellsPatches]; // allocate pointers
normalizedInputs[0] = new float [noCellsPatches*no_input_sizes]; // allocate data
// set pointers
for (int i = 1; i < noCellsPatches; ++i) {
normalizedInputs[i] = &normalizedInputs[i-1][no_input_sizes];
}
Then you can use normalizedInputs[i][j] as normal in C++ and the normalizedInputs[0] or &normalizedInputs[0][0] expression for your TF_NewTensor call.
Here is a mechanically simpler solution, just use a flat 1d array.
float * normalizedInputs = new float [noCellsPatches*no_input_sizes];
You access the i,j-th element by normalizedInputs[i*no_input_sizes+j] and you can use it directly in the TF_NewTensor call without worrying about any addresses.
C++ standard does its best to prevent programmers to use raw arrays, specifically multi-dimensional ones.
From your comment, your statically declared array is declared as:
float normalizedInputs[noCellsPatches][no_input_sizes];
If noCellsPatches and no_input_sizes are both compile time constants you have a correct program declaring a true 2D array. If they are not constants, you are declaring a 2D Variable Length Array... which does not exist in C++ standard. Fortunately, gcc allow it as an extension, but not MSVC nor clang.
If you want to declare a dynamic 2D array with non constant rows and columns, and use gcc, you can do that:
int (*arr0)[cols] = (int (*) [cols]) new int [rows*cols];
(the naive int (*arr0)[cols] = new int [rows][cols]; was rejected by my gcc 5.4.0)
It is definitely not correct C++ but is accepted by gcc and does what is expected.
The trick is that we all know that the size of an array of size n in n times the size of one element. A 2D array of rows rows of columnscolumns if then rows times the size of one row, which is columns when measured in underlying elements (here int). So we ask gcc to allocate a 1D array of the size of the 2D array and take enough liberalities with the strict aliasing rule to process it as the 2D array we wanted. As previously said, it violates the strict aliasing rule and use VLA in C++, but gcc accepts it.

How Physically are Arrays Stored (Specifically with dimensions greater than 2)?

What I Know
I know that arrays int ary[] can be expressed in the equivalent "pointer-to" format: int* ary. However, what I would like to know is that if these two are the same, how physically are arrays stored?
I used to think that the elements are stored next to each other in the ram like so for the array ary:
int size = 5;
int* ary = new int[size];
for (int i = 0; i < size; i++) { ary[i] = i; }
This (I believe) is stored in RAM like: ...[0][1][2][3][4]...
This means we can subsequently replace ary[i] with *(ary + i) by just increment the pointers' location by the index.
The Issue
The issue comes in when I am to define a 2D array in the same way:
int width = 2, height = 2;
Vector** array2D = new Vector*[height]
for (int i = 0; i < width; i++) {
array2D[i] = new Vector[height];
for (int j = 0; j < height; j++) { array2D[i][j] = (i, j); }
}
Given the class Vector is for me to store both x, and y in a single fundamental unit: (x, y).
So how exactly would the above be stored?
It cannot logically be stored like ...[(0, 0)][(1, 0)][(0, 1)][(1, 1)]... as this would mean that the (1, 0)th element is the same as the (0, 1)th.
It cannot also be stored in a 2d array like below, as the physical RAM is a single 1d array of 8 bit numbers:
...[(0, 0)][(1, 0)]...
...[(0, 1)][(1, 1)]...
Neither can it be stored like ...[&(0, 0)][&(1, 0)][&(0, 1)][&(1, 1)]..., given &(x, y) is a pointer to the location of (x, y). This would just mean each memory location would just point to another one, and the value could not be stored anywhere.
Thank you in advanced.
What OP is struggling with a dynamically allocated array of pointers to dynamically allocated arrays. Each of these allocations is its own block of memory sitting somewhere in storage. There is no connection between them other than the logical connection established by the pointers in the outer array.
To try to visualize this say we make
int ** twodee;
twodee = new int*[4];
for (int i = 0; i < 4; i++)
{
twodee[i] = new int[4];
}
and then
int count = 1;
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
{
twodee[i][j] = count++;
}
}
so we should wind up with twodee looking something like
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
right?
Logically, yes. But laid out in memory twodee might look something like this batsmurph crazy mess:
You can't really predict where your memory will be, you're at the mercy of the whatever memory manager handles the allocations and what already in storage where it might have been efficient for your memory to go. This makes laying dynamically-allocated multi-dimensional arrays out in your head almost a waste of time.
And there are a whole lot of things wrong with this when you get down into the guts of what a modern CPU can do for you. The CPU has to hop around a lot, and when it's hopping, it's ability to predict and preload the cache with memory you're likely to need in the near future is compromised. This means your gigahertz computer has to sit around and wait on your megahertz RAM a lot more than it should have to.
Try to avoid this whenever possible by allocating single, contiguous blocks of memory. You may pick up a bit of extra code mapping one dimensional memory over to other dimensions, but you don't lose any CPU time. C++ will have generated all of that mapping math for you as soon as you compiled [i][j] anyway.
The short answer to your question is: It is compiler dependent.
A more helpful answer (I hope) is that you can create 2D arrays that are layed out directly in memory, or you can create "2D arrays" that are actually 1D arrays, some with data, some with pointers to arrays.
There is a convention that the compiler is happy to generate the right kind of code to dereference and/or calculate the address of an element within an array when you use brackets to access an element in the array.
Generally arrays that are known to be 2D at compile time (eg int array2D[a][b]) will be layed out in memory without extra pointers and the compiler knows to multiply AND add to get an address each time there is an access. If your compiler isn't good at optimizing out the multiply, it makes repeated accesses much slower than they can be, so in the old days we often did pointer math ourselves to avoid the multiply if possible.
There is the issue that a compiler might optimize by rounding the lower dimension size up to a power of two, so a shift can be used instead of multiply, which would then require padding the locations (then even though they are all in one memory block, there are meaningless holes).
(Also, I'm pretty sure I've run into the problem that within a procedure, it needs to know which way the 2D array really is, so you may need to declare parameters in a way that lets the compiler know how to code the procedure, eg a[][] is different from *a[]). And obviously you can actually get the pointer from the array of pointers, if that is what you want--which isn't the same thing as the array it points too, of course.
In your code, you have clearly declared a full set of the lower dimension 1D arrays (inside the loop), and you have ALSO declared another 1D array of pointers you use to get to each one without a mulitply--instead by a dereference. So all those things will be in memory. Each 1D array will surely be sequentially layed out in a contiguous block of memory. It is just that it is entirely up to the memory manager as to where those 1D arrays are, relative to each other. (I doubt a compiler is smart enough to actually do the "new" ops at compile time, but it is theoretically possible, and would obviously affect/control the behavior if it did.)
Using the extra array of pointers clearly avoids the multiply ever and always. But it takes more space, and for sequential access actually makes the accesses slower and bigger (the extra dereference) versus maintaining a single pointer and one dereference.
Even if the 1D arrays DO end up contiguous sometimes, you might break it with another thread using the same memory manager, running a "new" while your "new" inside the loop is repeating.

Error: Deallocating a 2D array

I am developing a program in which one of the task is to read points (x,y and z) from a text file and then store them in an array. Now the text file may contain 10^2 or even 10^6 points, depending upon the text file user selects. Therefore I am defining a dynamic array.
For allocating a dynamic 2D array, I wrote as below and it works fine:
const int array_size = 100000;
float** array = new float* [array_size];
for(int i = 0; i < array_size; ++i){
ary[i] = new float[2]; // 0,1,2 being the columns for x,y,z co-ordinates
}
After the points are saved in the array, I write the following to deallocate the unallocated memory :
for (int i = 0; i < array_size; i++){
delete [] array[i];
}
delete [] array;
and then my program stops working and shows "Project.exe stopped working".
If I don't deallocate, the program works just fine.
In your comment you say 0,1,2 being the columns for x,y,z co-ordinates, if that's the case, you need to be allocating as float[3]. When you allocate an array of float[N], you are allocating a chunk of the memory of the size N * sizeof(float), and you will index them in the array from 1 to N - 1. Therefore if you need indeces 0,1,2, you will need to allocate a memory of the size 3 * sizeof(float), which makes it float[3].
Because other than that, I can compile and run the code without an error. If you fix it and still get an error, it might be your compiler problem. Then try to decrease 100000 to a small number and try again.
You are saying that you are trying to implement a dynamic array, this is what std::vector does and I would highly recommend that you use it. This way you are using something from the standard library that's extremely well tested and you won't run into issues by essentially trying to roll your own version of std::vector. Additionally this approach wraps memory better as it uses RAII which leverages the language to solve a lot of memory management issues. This has other benefits too like making your code more exception safe.
Also if you are storing x,y,z coordinates consider using a struct or a tuple, I think that enhances readability a lot. You can typedef the coordinate type too. Something like std::vector< coord_t > is more readable to me.
(Thanx a lot for suggestions!!)
Finally I am using vectors for the stated problem for reasons as below:
1.Unlike Arrays (not array object ofcourse), I don't need to manually deallocate unallocated memory.
2.There are numerous built in methods defined under vector class
Vector size can be extended at later stages
Below is how I used 2D Vector to store points (x,y,z co-ordinates)
Initialized (allocated memory) a 2D vector:
vector<vector<float>> array (1000, vector<float> array (3));
Where 1000 is the number of rows, and 3 is the number of columns
Once declared, values can be passed simply as:
array[i][j] = some value;
Also, at later stage I declared functions taking vector arguments and returning vectors as:
vector <vector <float>> function_name ( vector <vector <float>>);
vector <vector <float>> function_name ( vector <vector <float>> input_vector_name)
{
return output_vector_name_created_inside_function
}
Note: This method crates a copy of vector while returning, use pointer to return by reference. Even though mine is not working when I return vector by reference :(
For multi arrays I recommended use boost::multi_array.
Example:
typedef boost::multi_array<double, 3> array_type;
array_type A(boost::extents[3][4][2]);
A[0][0][0] = 3.14;

Flattening a 3D array in c++ for use with MPI

Can anyone help with the general format for flattening a 3D array using MPI? I think I can get the array 1 dimensional just by using (i+xlength*j+xlength*ylength*k), but then I have trouble using equations that reference particular cells of the array.
I tried chunking the code into chunks based on how many processors I had, but then when I needed a value that another processor had, I had a hard time. Is there a way to make this easier (and more efficient) using ghost cells or pointer juggling?
You have two options at least. The simpler one is to declare a preprocessor macro that hides the complexity of the index calculation, e.g.:
#define ARR(A,i,j,k) A[(i)*ylength*zlength+(j)*zlength+(k)]
ARR(myarray,i,j,k) = ARR(myarray,i+1,j,k) + ARR(myarray,i,j+1,k) + ...
This is clumsy since the macro will only work with arrays of fixed leading dimensions, e.g. whatever x ylength x zlength.
Much better way to do it is to use so-called dope vectors. Dope vectors are basically indices into the big array. You allocate one big flat chunk of size xlength * ylength * zlength to hold the actual data and then create an index vector (actually a tree in the 3D case). In your case the index has two levels:
top level, consisting of xlength pointers to the
second level, consisting of xlength arrays of pointers, each containing ylength pointers to the beginning of a block of zlength elements in memory.
Let's call the top level pointer array A. Then A[i] is a pointer to a pointer array that describes the i-th slab of data. A[i][j] is the j-th element of the i-th pointer array, which points to data[i][j][0] (if data was a 3D array). Construction of the dope vector works similar to this:
double *data = new double[xlength*ylength*zlength];
double ***A;
A = new double**[xlength];
for (int i = 0; i < xlength; i++)
{
A[i] = new double*[ylength];
for (int j = 0; j < ylength; j++)
A[i][j] = data + i*ylength*zlength + j*zlength;
}
Dope vectors are as easy to use as normal arrays with some special considerations. For example, A[i][j][k] will give you access to the desired element of data. One caveat of dope vectors is that the top level consist of pointers to other pointer tables and not of pointers to the data itself, hence A cannot be used as shortcut for &A[0][0][0], nor A[i] used as shortcut for &A[i][0][0]. Still A[i][j] is equivalent to &A[i][j][0]. Another caveat is that this form of array indexing is slower than normal 3D array indexing since it involves pointer chasing.
Some people tend to allocate a single storage block for both data and dope vectors. They simply place the index at the beginning of the allocated block and the actual data goes after that. The advantage of this method is that disposing the array is as simple as deleting the whole memory block, while disposing dope vectors, created with the code from the previous section, requires multiple invocations of the free operator.

Dynamic Memory allocation in 2-D Array

I have to create dynamic 2d array.I was trying with the code mentioned below:
Suppose a=512 and b =102.So Now the 2-D Array created is ary[512][102].
Now I am Creating a pointer to base location i.e int *ptr=&(ary[0][0]);
Now, if I am giving pointer the offset 102 i.e ptr+=102, it should point to &(ary[1][0]) but it doesn't point to.
If given offset of 104 then only it points to &(ary[1][0]). Why it requires extra 2 Offset????
Code Snippet:
int** ary;
ary= new int*[a];
for(int i = 0; i < a; ++i)
ary[i] = new int[b];
your 2d array is dynamically allocated as your code suggested, so the address of this 2d arrays is not in sequence, thus &(ary[0][0])+102, you can't grantee the pointer returned is the next row. So in this case, you should just use &(ary[1][0])
Unlike in the static allocation of 2D arrays, the addresses are not consecutive for the whole array when you use multiple new. Instead when you call for new() to create rows of a 2D array each new returns completely unrelated addresses which are allocated from heap. So you have to use ary[n] to access elements of (n + 1)th row and ary[n+1] to access elements of (n+2)th row and so on.
In this case, your 2-D array is not necessarily linear in the actual memory. So you can not write a program assumming that the memory is linear.
The reason for 104 pointing to ary[1][0] could be the boundary alignment. The run time system chooses to use 104 bytes for your ary[i], rather than 102.