Apparent CUDA magic - c++

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time

The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

Related

OpenCL Pipeline failed to allocate buffer with cl_mem_object_allocation_failure

I have an OpenCL pipeline that process image/video and it can be greedy with the memory sometimes. It is crashing on cl::Buffer() allocation like this:
cl_int err = CL_SUCCESS;
cl::Buffer tmp = cl::Buffer(m_context, CL_MEM_READ_WRITE, sizeData, NULL, &err);
with the error -4 - cl_mem_object_allocation_failure.
This occurs at a fix point in my pipeline by using very large images. If I just downscale the image a bit, it pass through the pipeline at this very memory intensive part.
I have access to a Nvidia card with 4go that bust at a certain point, and also tried on an AMD GPU with 2go which bust earlier.
According to this thread, there is no need to know the current allocation due to swapping with VRAM, but it seems that my pipeline bust the memory of my device.
So here are my question:
1) Is there any settings on my computer, or pipeline to set to allow more VRAM ?
2) Is it okay to use CL_DEVICE_GLOBAL_MEM_SIZE as reference of the maximum size to allocate, or I need to do CL_DEVICE_GLOBAL_MEM_SIZE - (local memory + private), or something like that ?
According to my own memory profiler, I have 92% of the CL_DEVICE_GLOBAL_MEM_SIZE allocated at the crash. And by resizing a bit, the pipeline says that I used 89% on the resized image and it passed, so I assume that my large image is on the edge to pass.
Some parts of your device's VRAM may be used for the pixel buffer, constant memory, or other uses. For AMD cards, you can set the environment variables GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_PERCENT to use a larger part of the VRAM, though this may have unintended side-effects. Both are expressed as percentages of your physically available memory on the card. Additionally, there is a limit on the size for each memory allocation. You can get the maximum size for a single memory allocation by querying CL_DEVICE_MAX_MEM_ALLOC_SIZE, which may be less than CL_DEVICE_GLOBAL_MEM_SIZE. For AMD cards, this size can be controlled with GPU_SINGLE_ALLOC_PERCENT. This requires no changes to your code, simply set the variables before you call your executable:
GPU_MAX_ALLOC_PERCENT="100"
GPU_MAX_HEAP_SIZE="100"
GPU_SINGLE_ALLOC_PERCENT="100"
./your_program

Reordering of work dimensions may cause a huge performance boost, but why?

I am using OpenCL for stereo image processing on the GPU and after I ported a C++ implementation to OpenCL i was playing around with optimizations. A very simple experiment was to swap around the dimensions.
Consider a simple kernel, which is executed for every pixel along the two dimensional work space (f.e. 640x480). In my case it was a Census transform.
Swapping from:
int globalU = get_global_id(0);
int globalV = get_global_id(1);
too
int globalU = get_global_id(1);
int globalV = get_global_id(0);
while adjusting the NDRange in the same way, gave a performance boost about 500%. Other experiments in 3d Space achieved a execution time from 72ms to 2ms, only with reordering of the dimensions.
Can anybody explain my, how this happens?
Is it just an effect of memory pipelines and cache usage?
EDIT: The image has a standart mamory layout. Thats why i wondered about the effects. I expected the best speed, when the iteration goes like the image is stored in the memory, which is not the case.
After some reading of the AMD APP SDK documantation, i found some interesting details about the memory channels. That could be a reason.
When you access an element in a memory in first being loaded into a CPU's cache. CPU does not load single element (say 1 byte), but instead it loads a single line (for example 64 adjacent bytes) into cache. This is because it is usually likely that you will access subsequent elements, so CPU would not need to access RAM again.
This is make a huge difference since to access cache memory, an electrical signal should not even leave CPU chip, while if CPU need to load data from RAM, the signal need to travel to a separate chip and probably more then one signal from CPU is required since it generally need to specify row and column in RAM to access part of it (Read What every programmer should know about memory for more info). In practice access time to cache may take only 0.5 ns while RAM access will cost 100 ns.
So computer algorithms should take this into account. If you traverse through all elements in a matrix, you should traverse them so you would access elements that are located near each other approximately at the same time. So if your matrix has the following layout in memory:
m_0_0, m_0_1, m_0_2, ... m_1_0, m_1_1 (first column, second column, etc.)
you should access elements in order: m_0_0, m_0_1, m_0_2 (by column)
If you would use different access order (say by row in this case), CPU would load part of first column in cache when you access first element in first column, then part of second column when you access first element in second column etc. When you will traverse first row and access second element in first column cache values for the first column would no longer be present in cache, since it has limited (and very small) size. Therefore such an algorithm would effectively eliminate cache at all.

how to get maximum array size fitting in to gpu memory?

I am using thrust with cuda 5.5 to make integer vector sort.
Sorting 100*1024*1024 int's should allocate 400MB memory,but nvidia-smi shows always "Memory-Usage 105MB / 1023MB".(my test GPU is GTX260M)
sorting 150*1024*1024 gives allocation error:
terminate called after throwing an instance of 'thrust::system::detail::bad_alloc'
what(): std::bad_alloc: out of memory
Aborted (core dumped)
before array allocation I am checking memory using cudaMemGetInfo it returns:
GPU memory usage: used = 105.273682, free = 918.038818 MB, total =
1023.312500 MB
Can I check maximum memory available for my integer array before starting gpu analysis?
EDIT:
Ok, before sort my memory usage is about this.
GPU memory usage: used = 545.273682, free = 478.038818 MB, total = 1023.312500 MB
seems to me sort algorithm needs some additional memory.
Thrust sorting operations require significant extra temporary storage.
nvidia-smi is effectively sampling memory usage at various times, and the amount of memory in use at the sampling point may not be reflective of the max memory used (or required) by your application. As you've discovered cudaMemGetInfo may be more useful.
I've generally found thrust to be able to sort arrays up to about 40% of the memory on your GPU. However there is no specified number and you may need to determine it by trial and error.
Don't forget that CUDA uses some overhead memory, and if your GPU is hosting a display, that will consume additional memory as well.

tile_static dynamically indexed arrays; should I even bother?

I'm going to great lengths to try and store frequently accessed data in tile_static memory to take advantage of the boundless performance nirvana which will ensue.
However, I've just read that only certain hardware/drivers can actually dynamically index tile_static arrays, and that the operation might just spill over to global memory anyway.
In an ideal world I'd just do it and profile, but this is turning out to be a major operation and I'd like to get an indication as to whether or not I'm wasting my time here:
tile_static int staticArray[128];
int resultFast = staticArray[0]; // this is super fast
// but what about this:
i = // dynamically derived value!
int resultNotSoFast = staticArray[i]; // is this faster than getting it from global memory?
How can I find out whether my GPU/driver supports dynamic indexing of static arrays?
Dynamic Indexing of Local Memory
So I did some digging on this because I wanted to understand this too.
If you are referring to dynamic indexing of local memory, not tile_static (or in CUDA parlance, "shared memory"). In your example above staticArray should be declared as:
int staticArray[128]; // not tile_static
This cannot be dynamically indexed because an array of int staticArray[128] is actually stored as 128 registers and these cannot be dynamically accessed. Allocating large arrays like this is problematic anyway because it uses up a large number of registers which are a limited resource on the GPU. Use too many registers per thread and your application will be unable to use all the available parallelism because some available threads will be stalled waiting for registers to become available.
In the case of C++ AMP I'm not even sure that the level of abstraction provided by DX11 may make this somewhat irrelevant. I'm not enough of an expert on DX11 to know.
There's a great explanation of this here, In a CUDA kernel, how do I store an array in "local thread memory"?
Bank Conflicts
Tile static memory is divided into a number of modules referred to as
banks. Tile static memory typically consists of 16, 32, or 64 banks,
each of which is 32 bits wide. This is specific to the particular GPU
hardware and might change in the future. Tile static memory is
interleaved across these banks. This means that for a GPU with tile
static memory implemented with 32 banks if arr is an array < float, 1>, then arr[ 1] and arr[ 33] are in the same bank because each float occupies a single 32-bit bank location. This is the key point to
understand when it comes to dealing with bank conflicts.
Each bank can
service one address per cycle. For best performance, threads in a warp
should either access data in different banks or all read the same data
in a single bank, a pattern typically optimized by the hardware. When
these access patterns are followed, your application can maximize the
available tile static memory bandwidth. In the worst case, multiple
threads in the same warp access data from the same bank. This causes
these accesses to be serialized, which might result in a
significant degradation in performance.
I think the key point of confusion might be (based on some of your other questions) is that a memory bank is 32 bits wide but is responsible for access to all the memory within the bank, which will be 1/16, 1/32 or 1/64 of the total tile static memory.
You can read more about bank conflicts here What is a bank conflict? (Doing Cuda/OpenCL programming)

Is there a limit on the size of array that can be used in CUDA?

Written a program that calculates the integral of a simple function. When testing it I found that if I used an array of size greater than 10 million elements it produced the wrong answer. I found that the error seemed to be occurring when once the array had been manipulated in a CUDA kernel. 10 millions elements and below worked fine and produced the correct result.
Is there a size limit on the amount of elements that can be transferred across to the GPU or calculated upon the GPU?
P.S. using C style arrays containing floats.
There are many different kinds of memory that you can use with CUDA. In particular, you have
Linear Memory (cuMemAlloc)
Pinned Memory (cuMemHostAlloc)
Zero-Copy Memory (cuMemAllocHost)
Pitch Allocation (cuMemAllocPitch)
Textures Bound to Linear Memory
Textures Bound to CUDA Arrays
Textures Bound to Pitch Memory
...and cube maps and surfaces, which I will not list here.
Each kind of memory is associated with its own hardware resource limits, many of which you can find by using cuDeviceGetAttribute. The function cuMemGetInfo returns the amount of free and total memory on the device, but because of alignment requirements, allocating 1,000,000 floats may result in more than 1,000,000 * sizeof(float) bytes being consumed. The maximum number of blocks that you can schedule at once is also a limitation: if you exceed it, the kernel will fail to launch (you can easily find this number using cuDeviceGetAttribute). You can find out the alignment requirements for different amounts of memory using the CUDA Driver API, but for a simple program, you can make a reasonable guess and check the value of allocation function to determine whether the allocation was successful.
There is no restriction on the amount of bytes that you can transfer; using asynchronous functions, you can overlap kernel execution with memory copying (providing that your card supports this). Exceeding the maximum number of blocks you can schedule, or consuming the available memory on your device means that you will have to split up your task so that you can use multiple kernels to handle it.
For compute capability>=3.0 the max grid dimensions are 2147483647x65535x65535,
so for a that should cover any 1-D array of sizes up to 2147483647x1024 = 2.1990233e+12.
1 billion element arrays are definitely fine.
1,000,000,000/1024=976562.5, and round up to 976563 blocks. Just make sure that if threadIdx.x+blockIdx.x*blockDim.x>= number of elements you return from kernel without processing.