allocating shared memory - c++

i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this:
__global__ void Kernel(const int count)
{
__shared__ int a[count];
}
and i am getting an error saying
error: expression must have a constant value
count is const! Why am I getting this error? And how can I get around this?

CUDA supports dynamic shared memory allocation. If you define the kernel like this:
__global__ void Kernel(const int count)
{
extern __shared__ int a[];
}
and then pass the number of bytes required as the the third argument of the kernel launch
Kernel<<< gridDim, blockDim, a_size >>>(count)
then it can be sized at run time. Be aware that the runtime only supports a single dynamically declared allocation per block. If you need more, you will need to use pointers to offsets within that single allocation. Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.

const doesn't mean "constant", it means "read-only".
A constant expression is something whose value is known to the compiler at compile-time.

option one: declare shared memory with constant value (not the same as const)
__global__ void Kernel(int count_a, int count_b)
{
__shared__ int a[100];
__shared__ int b[4];
}
option two: declare shared memory dynamically in the kernel launch configuration:
__global__ void Kernel(int count_a, int count_b)
{
extern __shared__ int *shared;
int *a = &shared[0]; //a is manually set at the beginning of shared
int *b = &shared[count_a]; //b is manually set at the end of a
}
sharedMemory = count_a*size(int) + size_b*size(int);
Kernel <<<numBlocks, threadsPerBlock, sharedMemory>>> (count_a, count_b);
note: Pointers to dynamically shared memory are all given the same address. I use two shared memory arrays to illustrate how to manually set up two arrays in shared memory.

From the "CUDA C Programming Guide":
The execution configuration is specified by inserting an expression of the form:
<<<Dg, Db, Ns, S>>>
where:
Dg is of type dim3 and specifies the dimensioin and size of the grid ...
Db is of type dim3 and specifies the dimension and size of each block ...
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory. This dynamically allocated memory is used by any of the variables declared as an external array as mentioned in __shared__; Ns is optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream ...
So by using the dynamical parameter Ns, the user can specify the total size of shared memory one kernel function can use, no matter how many shared variables there are in this kernel.

You cannot declare shared variable like this..
__shared__ int a[count];
although if you are sure enough about the max size of array a then you can directly declare like
__shared__ int a[100];
but in this case you should be worried about how many blocks are there in your program , since fixing shared memory to a block ( and not getting utilized fully), will lead you to context switching with global memory( high latency) , thus poor performance...
There is a nice solution to this problem to declare
extern __shared__ int a[];
and allocating the memory while calling kernel from memory like
Kernel<<< gridDim, blockDim, a_size >>>(count)
but you should also be bothered here because if you are using more memory in blocks than you are assigning in kernel , you are going to getting unexpected results.

Related

Memory limit in int main()

I need to make a big array in one task (more than 10^7).
And what I found that if i do it int main the code wouldnt work (the program will exit before doing cout "Process returned -1073741571 (0xC00000FD)").
If I do it outside everything will work.
(I am using Code::Blocks 17.12)
// dont work
#include <bits/stdc++.h>
using namespace std;
const int N = 1e7;
int main() {
int a[N];
cout << 1;
return 0;
}
// will work
#include <bits/stdc++.h>
using namespace std;
const int N = 1e7;
int a[N];
int main() {
cout << 1;
return 0;
}
So I have questions:
-Why it happens?
-What can I do to define array in int main()? (actually if i do vector same size in int main() everything will work and it is strange)
There are four main types of memory which are interesting for C++ programmers: stack, heap, static memory, and the memory of registers.
In
const int N = 1e7;
int main(){int a[N];}
stack memory is deployed.
This type of memory is usually more limited than the heap and the static memory in size. For that reason, the error code is returned.
Operator new (or other function which allocates memory in heap) is needed so as to use heap:
const int N = 1e7;
int main(){int* a = new int[N]; delete a;}
Usually, the operator new is not used explicitly.
std::vector uses heap (i.e. it uses new or something of the lower level underneath) (as opposed to the std::array or the C-style array, e.g. int[N]). Because of that, std::vector is usually capable of holding bigger chunks of data than the std::array or the C-style array.
If you do
const int N = 1e7;
int a[N];
int main(){}
static memory is utilized. It's usually less limited in size than the stack memory.
To wrap up, you used stack in int main(){int a[N];}, static memory in int a[N]; int main(){}, and heap in int main(){std::vector<int> v(N);}, and, because of that, received different results.
Use heap for big arrays (via the std::vector or the operator new, examples are given above).
The problem is that your array is actually very big. Assuming that int is 4 bytes, 10 000 000 integers will be 40 000 000bytes which is about 40 Mb. In windows maximum stack size is 1Mb and on modern Linux 8Mb. As local variables are located in stack so youre allocating your 40mb array in 1mb or 8mb stack (if youre in windows or linux respectively). So your program runs out of stack space. In case of global array its ok, because global variables are located in bss(data) segment of program which has static size and is not changing. And in case of std::vector your array is allocated in dynamic memory e.g. in heap, thats why your program is not crashing. If you don't want to use std::vector you can dynamically allocate an array on heap like following
int* arrayPtr = new int[N]
Then you need to free unused dynamically allocated memory with delete operator:
delete arrayPtr;
But in this case you need to know how to work with pointers. Or if you want it to not be dynamic and be only in main, you can make your array in main static (I think 99.9% this will work 😅 and I think you need to try) like this
int main() {static int arr[N];return 0;}
Which will be located in data segment (like global variable)

Dynamic array in an array of structures in OpenCL

I have a struct :
struct A
{
double a;
int c;
double *array;
}
main()
{
A *str = new A[50];
for(int i=0;i<50;i++)
{
str[i].array = new double[5];
str[i].array[0] = 50;
}
.....
Buffer BufA = Buffer(...,..., 50 * sizeof(A),str);
.....
}
In kernel
struct A
{
double a;
int c;
double *array;
}
__kernel void vector(__global A *str)
{
int id = get_global_id(0);
printf("Element - %f",str[id].array[0]);
}
But in the kernel does not see the value in the array. Probably, because in the buffer I allocated memory for an array of structures without the memory of a dynamic array. How can I implement this?
On modern system, a process doesn't see the actual addresses of objects, but rather the virtual addresses of such objects.
This means, two processes cannot pass each others pointers and expect them to mean the same thing. You need to rethink your application with that in mind.
On top of the address virtualization mentioned by YSC, you should also keep in mind that the memory that your graphics card (or other OCL device) is operating on may be distinct (as in, different pieces of hardware) from the memory your CPU is operating on.
The OpenCL buffers are responsible for transporting their contents between these memories. So for example an array of ints that you create and write to on the CPU would have to be copied to GPU memory (and have space allocated there, and possibly be copied back after the kernel is done), which these buffers do for you. But if you store pointers to other CPU memory in your buffer, then that other memory will not be transferred automatically. Further, the pointer relation would most likely break, as there is no guarantee that your other data is at the same location in GPU memory as in CPU memory.
The solution, naturally, is to put all the data you want transferred into buffers, including the sub-arrays. One way to do this without using excessive amounts of buffers would be to pack the sub-arrays together into one and storing indices into it instead of pointers to memory.

How Dynamic memory allocation allocates memory during run time ?

int a[10];
The above code will create a array of four int variable sizes & thus the programme will be able to store only 4 integers.
Now consider the following commands
int *a,*b,*c,*d;
a= (int *)malloc(sizeof(int));
b= (int *)malloc(sizeof(int));
c= (int *)malloc(sizeof(int));
d= (int *)malloc(sizeof(int));
The above part of code will create four int type pointer & will allocate them memory of int size.
I learnt that dynamic memory allocation allocates memory at rum time.
I want to know that irrespective of using array or malloc(dynamic memory allocation), the user will be getting only four int sized space to store.If we rule out that it is a pointer variable with int size memory, then what will be the use of dynamic memory allocation.In both cases , the user will get only four int spaces & to get more he will need to access the source code.So why do we use malloc or dynamic memory allocation ?
Consider
int a,*b;
cin >> a;
b= (int *)malloc(a*sizeof(int));
The user types a number a and gets a ints. The number a is not known to either to programmer or the compiler here.
As pointed out in the comments, this is still bad style in C++, use std::vector if possible. Even new is still better than malloc. But i hope the (bad) example helps to clarify the basic idea behind dynamic memory allocation.
You're right that it's all just memory. But there is a difference in usage.
In the general case, you don't necessarily know ahead of time the amount of memory you will need and then time when such memory can be safely released. malloc and its friends are written so that they can keep track of memory used this way.
But in many special cases, you happen to know ahead of time how much memory you will need and when you will stop needing it. For example, you know you need a single integer to act as a loop counter when running a simple loop and you'll be done with it once the loop has finished executing. While malloc and its friends can still work for you here, local variables are simpler, less error prone and will likely be more efficient.
int a[10];
The above line of code will allocate an array of 10 int's of automatic storage duration, if it was within a local scope.
int *a,*b,*c,*d;
The above, however, will allocate 4 pointers to int also of automatic storage duration, likewise if it was within a local scope.
a= (int *)malloc(sizeof(int));
b= (int *)malloc(sizeof(int));
c= (int *)malloc(sizeof(int));
d= (int *)malloc(sizeof(int));
And finally, the above will allocate int variable per each pointer dynamically. So, every pointer of the above will be pointing to a single int variable.
Do note that dynamically allocated memory can be freed and resized at runtime unlike static memory allocation. Memory of automatic storage duration are freed when run out of scope, but cannot be resized.
If you program in C, casting the result of malloc is unnecessary.
I suggest you to read this: Do I cast the result of malloc?
Then what your doing in your code with the 4 pointers is unnecessary; in fact you can just allocate an array of 4 int with one malloc:
int *a;
a = malloc(4 * sizeof(int));

How I use global memory correctly in CUDA?

I'm trying to do an application in CUDA which uses global memory defined with device.
This variables are declared in a .cuh file.
In another file .cu is my main in which I do the cudaMallocs and the cudaMemCpy.
That's a part of my code:
cudaMalloc((void**)&varOne,*tam_varOne * sizeof(cuComplex));
cudaMemcpy(varOne,C_varOne,*tam_varOne * sizeof(cuComplex),cudaMemcpyHostToDevice);
varOne is declared in the .cuh file like this:
__device__ cuComplex *varOne;
When I launch my kernel (I'm not passing varOne as parameter) and try to read varOne with the debugger, it says that can't read the variable. The pointer address it 000..0 so it's obviously that it is wrong.
So, how I have to declare and copy the global memory in CUDA?
First, you need to declare the pointers to the data that will be copied from the CPU to the GPU. In the example above, we want to copy the array original_cpu_array to CUDA global memory.
int original_cpu_array[array_size];
int *array_cuda;
Calculate the memory size that the data will occupy.
int size = array_size * sizeof(int);
Cuda memory allocation:
msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
Copying from CPU to GPU:
msg_erro[0] = cudaMemcpy(array_cuda, original_cpu_array,size,cudaMemcpyHostToDevice);
Execute kernel
Copying from GPU to CPU:
msg_erro[0] = cudaMemcpy(original_cpu_array,array_cuda,size,cudaMemcpyDeviceToHost);
Free Memory:
cudaFree(array_cuda);
For debugging reasons, typically, I save the status of the functions calls in an array. (e.g., cudaError_t msg_erro[var];). This is not strictly necessary, but it will save you some time if an error occurs during the allocation and memory transferences.
And if errors do occur, I print them using a function like:
void printErros(cudaError_t *erros,int size, int flag)
{
for(int i = 0; i < size; i++)
if(erros[i] != 0)
{
if(flag == 0) printf("Alocacao de memoria");
if(flag == 1) printf("CPU -> GPU ");
if(flag == 2) printf("GPU -> CPU ");
printf("{%d} => %s\n",i ,cudaGetErrorString(erros[i]));
}
}
The flag is primarily to indicate the part in the code that the error occurred. For instance, after a memory allocation:
msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
printErros(msg_erro,msg_erro_size, 0);
I have experimented with some example and found that, you cannot directly use the global variable in the kernel without passing to it. Even though you initialize in .cuh file, you need to initialize in the main().
Reason:
If you declare it globally, the Memory is not allocated in the GPU Global Memory. You need to use cudaMalloc((void**)&varOne,sizeof(cuComplex)) for the allocation of memory. It can only allocate memory on GPU. The declaration __device__ cuComplex *varOne; works just as a prototype and variable declaration. But, the memory is not allocated until cudaMalloc((void**)&varOne,sizeof(cuComplex)) is used.
Also, you need to initialize the *varOne in main() as a Host pointer initially. After using cudaMalloc(), it comes to know that the pointer is Device Pointer.
The sequence of steps are: (for my tested code)
int *Ad; //If you can allocate this in .cuh file, you dont need the shown code in main()
__global__ void Kernel(int *Ad){
....
}
int main(){
....
int size=100*sizeof(int);
cudaMalloc((void**)&Ad,size);
cudaMemcpy(Ad,A,size,cudaMemcpyHostToDevice);
....
}

Stack overflow when declare 2 array

When I run my program with 1 array, like this:
int a[430][430];
int i, j, i_r0, j_r0;
double c, param1, param2;
int w_far = 0,h_far = 0;
char* magic_num1 = "";
it's good!
But, when I write:
int a[430][430];
int i, j, i_r0, j_r0;
int nicky[430][430]; // Added line
double c, param1, param2;
int w_far = 0,h_far = 0;
char* magic_num1 = "";
the program not run with the error: "stack overflow"!
I don't know how to solve it!
You need to either increase the stack space (how that is done depends on your platform), or you need to allocate the array from the heap, or even better, use std::vector instead of an array.
You're trying to allocate ~1.48 MB of stuff on the stack1, on your system (and not only on it) that's too much.
In general, the stack is not made for keeping big objects, you should put them in the heap instead; use dynamic allocation with new or std::vector, or, even better suited in your case, boost::multi_array.
1. Assuming 32 bit ints.
A proper solution is to use heap, but also note that you'll likely find that changing to:
short a[430][430];
short nicky[430][430]; // Added line
fixes the overflow, depending on your platform. So if 'short', or 'unsigned short' is big enough, this might be an option.
In fact, even when using the heap, consider carefully the array type to reduce memory footprint for a large array.
Local variables are allocated to "stack", which is a storage space used to several purposes and limited to a certain size.
Usually you can declare variables up to several kilobytes, but when you want to use more memory, usually suggested to use "heap", which can be allocated by new operator or std::vector.
std::vector is an alternate for traditional arrays, and its data is safely stored in heap.
To avoid stack overflow, allocate the arrays in the heap.
If one uses C, then allocating an array of size n in the heap can be done by e.g.
int* A = (int*) malloc(n*sizeof(int));
But you must remeber to free that memory when no longer needed with
free(A);
to avoid memory leak.
Equivalently in C++:
int* A = new int[n];
and free with
delete [] A;
This site was helpful.