I have a data structure with pointers (think linked lists). Its size can't be determined before launching the kernel that reads the input data. So I allocate data on the device during input processing.
However, trying to copy that data back to host fails. From what I could gather, this is because there is a limitation in CUDA that does not allow device-allocated memory to be accessed by the runtime API. That information, however, was for CUDA 4 with "a fix coming soon". Does anyone know if that fix or a workaround ever came? I can't seem to find any recent information on this.
Here's a reproducible example:
#include <cstdio>
__device__ int *devData;
__global__ void initKernel()
{
devData = new int[6];
devData[0] = 0;
devData[1] = 1;
devData[2] = 2;
devData[3] = 3;
devData[4] = 4;
devData[5] = 5;
}
__global__ void printKernel()
{
printf("Testing device: %d\n", devData[3]);
}
int main()
{
initKernel<<<1,1>>>();
cudaDeviceSynchronize();
printKernel<<<1,1>>>();
cudaDeviceSynchronize();
int *devAddr;
cudaGetSymbolAddress((void **)&devAddr, devData);
int *hostData = new int[6];
cudaMemcpy(hostData, devAddr, 6*sizeof(int), cudaMemcpyDeviceToHost)); //cudaErrorInvalidValue (invalid argument)
//same error with: cudaMemcpyFromSymbol(hostData, devData, 6*sizeof(int));
printf("Testing host: %d\n", testHost[3]);
return 0;
}
This throws a cudaErrorInvalidValue for cudaMemcpy (same for cudaMemcpyFromSymbol). This does not throw an error when I use __device__ int devData[6]; instead of __device__ int *devData; and prints 3 as expected.
It's still not possible.
This is documented in the programming guide.
In addition, device malloc() memory cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).
If you have data in allocations that were created by in-kernel malloc() that you wish to transfer to the host, you will need to transfer that data first to a device memory allocation (or managed allocation), before copying to host or using in host code.
The same comments and all aspects of usage for in-kernel malloc apply equally to in-kernel new as well as in-kernel cudaMalloc.
Related
I have a data structure with pointers (think linked lists). Its size can't be determined before launching the kernel that reads the input data. So I allocate data on the device during input processing.
However, trying to copy that data back to host fails. From what I could gather, this is because there is a limitation in CUDA that does not allow device-allocated memory to be accessed by the runtime API. That information, however, was for CUDA 4 with "a fix coming soon". Does anyone know if that fix or a workaround ever came? I can't seem to find any recent information on this.
Here's a reproducible example:
#include <cstdio>
__device__ int *devData;
__global__ void initKernel()
{
devData = new int[6];
devData[0] = 0;
devData[1] = 1;
devData[2] = 2;
devData[3] = 3;
devData[4] = 4;
devData[5] = 5;
}
__global__ void printKernel()
{
printf("Testing device: %d\n", devData[3]);
}
int main()
{
initKernel<<<1,1>>>();
cudaDeviceSynchronize();
printKernel<<<1,1>>>();
cudaDeviceSynchronize();
int *devAddr;
cudaGetSymbolAddress((void **)&devAddr, devData);
int *hostData = new int[6];
cudaMemcpy(hostData, devAddr, 6*sizeof(int), cudaMemcpyDeviceToHost)); //cudaErrorInvalidValue (invalid argument)
//same error with: cudaMemcpyFromSymbol(hostData, devData, 6*sizeof(int));
printf("Testing host: %d\n", testHost[3]);
return 0;
}
This throws a cudaErrorInvalidValue for cudaMemcpy (same for cudaMemcpyFromSymbol). This does not throw an error when I use __device__ int devData[6]; instead of __device__ int *devData; and prints 3 as expected.
It's still not possible.
This is documented in the programming guide.
In addition, device malloc() memory cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).
If you have data in allocations that were created by in-kernel malloc() that you wish to transfer to the host, you will need to transfer that data first to a device memory allocation (or managed allocation), before copying to host or using in host code.
The same comments and all aspects of usage for in-kernel malloc apply equally to in-kernel new as well as in-kernel cudaMalloc.
If I do something like:
void f() {
const int n = 1<<14;
int *foo = new int [n];
}
or
void f() {
const int n = 1<<14;
int *foo = new int [n]();
}
Will the Linux kernel will use lazy memory allocation? For the second case, in the same way than when creating static arrays?
How far can I take this? For instance, having a struct that will be filled with 0s, will it always be allocated lazily, or will it actually allocate physical RAM when it is initialized?
struct X {
int a, b, c, d, f, g, ..., z;
}
void f() {
int *foo = new X();//lazy?
const int n = 1<<14;
int *foo = new X [n]();//lazy?
}
For a standard Ubuntu 20.04 machine running Linux 5.4.0-51-generic....
We can observe this directly. In the code below, I increased the n value to 1 << 24 (~16 million ints = 64MB for 32-bit int) so it's the dominant factor in overall memory usage. I compiled, ran, and observed memory usage in htop:
#include <unistd.h>
int main() {
int *foo = new int [1 << 24];
sleep(100);
}
htop values: VIRT 71416KB / RES 1468KB
The virtual address allocations include the memory allocated by new, but the resident memory size is much smaller - indicating that distinct physical backing memory pages weren't needed yet for all the 64MB allocated.
After changing to int *foo = new int[1<<24]();:
htop values: VIRT 71416KB / RES 57800KB
Requesting the memory be zeroed resulted in a resident memory value just under the 64MB that was initialised, and it won't have been due to memory pressure (I have 64GB RAM), but some algorithm in the kernel must have decided to page out some of the backing memory after it was zeroed (I suspect kswapd?). The large RES value suggests that each page zeroed was given a distinct page of physical backing memory (as distinct from e.g. being mapped to the OS's zero-page for COW-allocation of an actual backing page).
With structs:
#include <unistd.h>
struct X {
int a[1 << 24];
};
int main() {
auto foo = new X;
sleep(100);
}
htop values: VIRT 71416KB / RES 1460KB
This shows insufficient RES for the static arrays to have distinct backing pages. Either the virtual memory has been pre-mapped to the OS zero-page, or it's unmapped and will be mapped initially to the zero-page when accessed, then given its own physical backing page if written to - I'm not sure which, but in terms of actual physical RAM usage it doesn't make any difference.
After changing to auto foo = new X{};
htop values: VIRT 71416KB / RES 67844KB
You can clearly see that initialising the bytes to 0s resulted in use of backing memory for the arrays.
Addressing your questions:
Will the Linux kernel will use lazy memory allocation?
The virtual memory allocation is done when the new is done. Distinct physical backing memory is allocated lazily when an actual write is done to the memory by the user-space code.
For the second case, in the same way than when creating static arrays?
#include <unistd.h>
int g_a[1 << 24];
int f(int i) {
static int a[1 << 24];
return a[i];
}
int main(int argc, const char* argv[]) {
sleep(20);
int k = f(2930);
sleep(20);
return argc + k;
}
VIRT 133MB RES 1596KB
When this was run, the memory didn't jump after 20 seconds, indicating all the virtual address space was allocated during program loading. The low resident memory shows that the pages were not accessed and zeroed the way they were for new.
Just to address a potential point of confusion: while the Linux Kernel will zero out backing memory the first time it's provided to the process, any given call to new won't (in any implementation I've seen) know whether the memory allocated is being recycled from earlier dynamic allocations - which might have had non-zero values written into it - that have since been deleted/freed. Because of this, if you use memory-zeroing forms like new X{} or new int[n]() then the memory will be unconditionally cleared by the user-space code, causing the full amount of backing memory to be assigned and faulted in.
As many comments said, operator new usually uses malloc under the hood. malloc allocates space but does not by default allocate physical pages. However, malloc often writes internal data to the beginning of a block of memory, so only the first or first couple of pages allocated as virtual address space will fault and be allocated physically by the Linux kernel. The Linux kernel zeroes all allocated physical pages so whether you add () to the end of the allocation to zero-initialize the allocated memory probably has no effect, in terms of new physical pages being assigned. (Already allocated physical pages mapped to the allocated virtual address range are zeroed in that case.)
I have a struct :
struct A
{
double a;
int c;
double *array;
}
main()
{
A *str = new A[50];
for(int i=0;i<50;i++)
{
str[i].array = new double[5];
str[i].array[0] = 50;
}
.....
Buffer BufA = Buffer(...,..., 50 * sizeof(A),str);
.....
}
In kernel
struct A
{
double a;
int c;
double *array;
}
__kernel void vector(__global A *str)
{
int id = get_global_id(0);
printf("Element - %f",str[id].array[0]);
}
But in the kernel does not see the value in the array. Probably, because in the buffer I allocated memory for an array of structures without the memory of a dynamic array. How can I implement this?
On modern system, a process doesn't see the actual addresses of objects, but rather the virtual addresses of such objects.
This means, two processes cannot pass each others pointers and expect them to mean the same thing. You need to rethink your application with that in mind.
On top of the address virtualization mentioned by YSC, you should also keep in mind that the memory that your graphics card (or other OCL device) is operating on may be distinct (as in, different pieces of hardware) from the memory your CPU is operating on.
The OpenCL buffers are responsible for transporting their contents between these memories. So for example an array of ints that you create and write to on the CPU would have to be copied to GPU memory (and have space allocated there, and possibly be copied back after the kernel is done), which these buffers do for you. But if you store pointers to other CPU memory in your buffer, then that other memory will not be transferred automatically. Further, the pointer relation would most likely break, as there is no guarantee that your other data is at the same location in GPU memory as in CPU memory.
The solution, naturally, is to put all the data you want transferred into buffers, including the sub-arrays. One way to do this without using excessive amounts of buffers would be to pack the sub-arrays together into one and storing indices into it instead of pointers to memory.
I wrote a cuda kernel like this
__global__ void mykernel(int size; int * h){
double *x[size];
for(int i = 0; i < size; i++){
x[i] = new double[2];
}
h[0] = 20;
}
void main(){
int size = 2.5 * 100000 // or 10,000
int *h = new int[size];
int *u;
size_t sizee = size * sizeof(int);
cudaMalloc(&u, sizee);
mykernel<<<size, 1>>>(size, u);
cudaMemcpy(&h, &u, sizee, cudaMemcpyDeviceToHost);
cout << h[0];
}
I have some other code in the kernel too but I have commented it out. The code above it also allocates some more memory.
Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;
When I run this with size = 100*100 I get h[0] value to be 20;
So I am guessing that my kernels are crashing cause I am running out of memory. I am using a Tesla card C2075 which has ram 2GB ! I even tried this by shutting down the xserver. What I am working on is not even 100mb of data.
How can I allocate more memory to each block?
Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;
When I run this with size = 100*100 I get h[0] value to be 20;
In your kernel launch, you are using this size variable also:
mykernel<<<size, 1>>>(size, u);
^^^^
On a cc2.0 device (Tesla C2075), this particular parameter in the 1D case is limited to 65535. So 2.5*10^5 exceeds 65535, but 100*100 does not. Therefore, your kernel may be running if you specify size of 100*100, but is probably not running if you specify size of 2.5*10^5.
As already suggested to you, proper cuda error checking should point this error out to you, and in general will probably result in you needing to ask far fewer questions on SO, as well as posting higher-quality questions on SO. Take advantage of the CUDA runtime's ability to let you know when things have gone wrong and when you are making a mistake. Then you won't be in a quandary, thinking you have a memory allocation problem when in fact you probably have a kernel launch configuration problem.
How can I allocate more memory to each block?
Although it is probably not your main issue (as indicated above), in-kernel new and malloc are limited to the size of the device heap. Once this has been exhausted, further calls to new or malloc will return a null pointer. If you use this null pointer anyway, your kernel code will begin to perform unspecified behavior, and will likely crash.
When using new and malloc, especially when you're having trouble, it's good practice to check for a null return value. This applies to both host (at least for malloc) and device code.
The size of the device heap is pretty small to begin with (8MB), but it can be modified.
Referring to the documentation:
The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context. A default heap of eight megabytes is allocated if any program uses malloc() without explicitly specifying the heap size.
The following API functions get and set the heap size:
•cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
•cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The heap size granted will be at least size bytes. cuCtxGetLimit()and cudaDeviceGetLimit() return the currently requested heap size.
I'm trying to do an application in CUDA which uses global memory defined with device.
This variables are declared in a .cuh file.
In another file .cu is my main in which I do the cudaMallocs and the cudaMemCpy.
That's a part of my code:
cudaMalloc((void**)&varOne,*tam_varOne * sizeof(cuComplex));
cudaMemcpy(varOne,C_varOne,*tam_varOne * sizeof(cuComplex),cudaMemcpyHostToDevice);
varOne is declared in the .cuh file like this:
__device__ cuComplex *varOne;
When I launch my kernel (I'm not passing varOne as parameter) and try to read varOne with the debugger, it says that can't read the variable. The pointer address it 000..0 so it's obviously that it is wrong.
So, how I have to declare and copy the global memory in CUDA?
First, you need to declare the pointers to the data that will be copied from the CPU to the GPU. In the example above, we want to copy the array original_cpu_array to CUDA global memory.
int original_cpu_array[array_size];
int *array_cuda;
Calculate the memory size that the data will occupy.
int size = array_size * sizeof(int);
Cuda memory allocation:
msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
Copying from CPU to GPU:
msg_erro[0] = cudaMemcpy(array_cuda, original_cpu_array,size,cudaMemcpyHostToDevice);
Execute kernel
Copying from GPU to CPU:
msg_erro[0] = cudaMemcpy(original_cpu_array,array_cuda,size,cudaMemcpyDeviceToHost);
Free Memory:
cudaFree(array_cuda);
For debugging reasons, typically, I save the status of the functions calls in an array. (e.g., cudaError_t msg_erro[var];). This is not strictly necessary, but it will save you some time if an error occurs during the allocation and memory transferences.
And if errors do occur, I print them using a function like:
void printErros(cudaError_t *erros,int size, int flag)
{
for(int i = 0; i < size; i++)
if(erros[i] != 0)
{
if(flag == 0) printf("Alocacao de memoria");
if(flag == 1) printf("CPU -> GPU ");
if(flag == 2) printf("GPU -> CPU ");
printf("{%d} => %s\n",i ,cudaGetErrorString(erros[i]));
}
}
The flag is primarily to indicate the part in the code that the error occurred. For instance, after a memory allocation:
msg_erro[0] = cudaMalloc((void **)&array_cuda,size);
printErros(msg_erro,msg_erro_size, 0);
I have experimented with some example and found that, you cannot directly use the global variable in the kernel without passing to it. Even though you initialize in .cuh file, you need to initialize in the main().
Reason:
If you declare it globally, the Memory is not allocated in the GPU Global Memory. You need to use cudaMalloc((void**)&varOne,sizeof(cuComplex)) for the allocation of memory. It can only allocate memory on GPU. The declaration __device__ cuComplex *varOne; works just as a prototype and variable declaration. But, the memory is not allocated until cudaMalloc((void**)&varOne,sizeof(cuComplex)) is used.
Also, you need to initialize the *varOne in main() as a Host pointer initially. After using cudaMalloc(), it comes to know that the pointer is Device Pointer.
The sequence of steps are: (for my tested code)
int *Ad; //If you can allocate this in .cuh file, you dont need the shown code in main()
__global__ void Kernel(int *Ad){
....
}
int main(){
....
int size=100*sizeof(int);
cudaMalloc((void**)&Ad,size);
cudaMemcpy(Ad,A,size,cudaMemcpyHostToDevice);
....
}