cuda program kernel code in device memory space - c++

Is there any way to find out, how much memory occupies the kernel code (execution) in gpu (device) memory?
If I have 512 MB device memory how can I know how much is available for allocation?
Could visual profiler show such info?

Program code uses up very little memory. The rest of the CUDA context (local memory, constant memory, printf buffers, heap and stack) uses a lot more. The CUDA runtime API includes the cudeGetMemInfo call which will return the amount of free memory available to your code. Note that because of fragmentation and page size constraints, you won't be able to allocate every last free byte of memory. The best strategy is to start with the maximum and recursively attempt allocating successively smaller allocations until you get a successful allocation.
You can see a fuller explanation of device memory consumption in my answer to an earlier question along similar lines,

Related

Does memory fragmentation slows down New/Malloc?

Short background:
I'm developing a system that should run for months and using dynamic allocations.
The question:
I've heard that memory fragmentation slows down new and malloc operators because they need to "find" a place in one of the "holes" I've left in the memory instead of simply "going forward" in the heap.
I've read the following question:
What is memory fragmentation?
But none of the answers mentioned anything regarding performance, only failure allocating large memory chunks.
So does memory fragmentation make new take more time to allocate memory?
If yes, by how much? How do I know if new is having a "Hard time" finding memory on the heap ?
I've tried to find what are the data structures/algorithms GCC uses to find a "hole" in the memory to allocate inside. But couldn't find any descent explanation.
Memory allocation is platform specific, depending on the platform.
I would say "Yes, new takes time to allocate memory. How much time depends on many factors, such as algorithm, level of fragmentation, processor speed, optimizations, etc.
The best answer for how much time is taken, is to profile and measure. Write a simple program that fragments the memory, then measure the time for allocating memory.
There is no direct method for a program to find out the difficulty of finding available memory locations. You may be able to read a clock, allocate memory, then read again. Another idea is to set a timer.
Note: in many embedded systems, dynamic memory allocation is frowned upon. In critical systems, fragmentation can be the enemy. So fixed sized arrays are used. Fixed sized memory allocations (at compile time) remove fragmentation as an defect issue.
Edit 1: The Search
Usually, memory allocation requires a call to a function. The impact of the this is that the processor may have to reload its instruction cache or pipeline, consuming extra processing time. There also may be extra instruction for passing parameters such as the minimal size. Local variables and allocations at compile time usually don't need a function call for allocation.
Unless the allocation algorithm is linear (think array access), it will require steps to find an available slot. Some memory management algorithms use different strategies based on the requested size. For example, some memory managers may have separate pools for sizes of 64-bits or smaller.
If you think of a memory manager as having a linked list of blocks, the manager will need to find the first block greater than or equal in size to the request. If the block is larger than the requested size, it may be split and the left over memory is then created into a new block and added to the list.
There is no standard algorithm for memory management. They differ based on the needs of the system. Memory managers for platforms with restricted (small) sizes of memory will be different than those that have large amounts of memory. Memory allocation for critical systems may be different than those for non-critical systems. The C++ standard does not mandate the behavior of a memory manager, only some requirements. For example, the memory manager is allowed to allocate from a hard drive, or a network device.
The significance of the impact depends on the memory allocation algorithm. The best path is to measure the performance on your target platform.

How to calculate the largest possible float array I can copy to GPU? [duplicate]

I'm writing a server process that performs calculations on a GPU using cuda. I want to queue up in-coming requests until enough memory is available on the device to run the job, but I'm having a hard time figuring out how much memory I can allocate on the the device. I have a pretty good estimate of how much memory a job requires, (at least how much will be allocated from cudaMalloc()), but I get device out of memory long before I've allocated the total amount of global memory available.
Is there some king of formula to compute from the total global memory the amount I can allocated? I can play with it until I get an estimate that works empirically, but I'm concerned my customers will deploy different cards at some point and my jerry-rigged numbers won't work very well.
The size of your GPU's DRAM is an upper bound on the amount of memory you can allocate through cudaMalloc, but there's no guarantee that the CUDA runtime can satisfy a request for all of it in a single large allocation, or even a series of small allocations.
The constraints of memory allocation vary depending on the details of the underlying driver model of the operating system. For example, if the GPU in question is the primary display device, then it's possible that the OS has also reserved some portion of the GPU's memory for graphics. Other implicit state the runtime uses (such as the heap) also consumes memory resources. It's also possible that the memory has become fragmented and no contiguous block large enough to satisfy the request exists.
The CUDART API function cudaMemGetInfo reports the free and total amount of memory available. As far as I know, there's no similar API call which can report the size of the largest satisfiable allocation request.

Memory stability of a C++ application in Linux

I want to verify the memory stability of a C++ application I wrote and compiled for Linux.
It is a network application that responds to remote clients connectings in a rate of 10-20 connections per second.
On long run, memory was rising to 50MB, eventhough the app was making calls to delete...
Investigation shows that Linux does not immediately free memory. So here are my questions :
How can force Linux to free memory I actually freed? At least I want to do this once to verify memory stability.
Otherwise, is there any reliable memory indicator that can report memory my app is actually holding?
What you are seeing is most likely not a memory leak at all. Operating systems and malloc/new heaps both do very complex accounting of memory these days. This is, in general, a very good thing. Chances are any attempt on your part to force the OS to free the memory will only hurt both your application performance and overall system performance.
To illustrate:
The Heap reserves several areas of virtual memory for use. None of it is actually committed (backed by physical memory) until malloc'd.
You allocate memory. The Heap grows accordingly. You see this in task manager.
You allocate more memory on the Heap. It grows more.
You free memory allocated in Step 2. The Heap cannot shrink, however, because the memory in #3 is still allocated, and Heaps are unable to compact memory (it would invalidate your pointers).
You malloc/new more stuff. This may get tacked on after memory allocated in step #3, because it cannot fit in the area left open by free'ing #2, or because it would be inefficient for the Heap manager to scour the heap for the block left open by #2. (depends on the Heap implementation and the chunk size of memory being allocated/free'd)
So is that memory at step #2 now dead to the world? Not necessarily. For one thing, it will probably get reused eventually, once it becomes efficient to do so. In cases where it isn't reused, the Operating System itself may be able to use the CPU's Virtual Memory features (the TLB) to "remap" the unused memory right out from under your application, and assign it to another application -- on the fly. The Heap is aware of this and usually manages things in a way to help improve the OS's ability to remap pages.
These are valuable memory management techniques that have the unmitigated side effect of rendering fine-grained memory-leak detection via Process Explorer mostly useless. If you want to detect small memory leaks in the heap, then you'll need to use runtime heap leak-detection tools. Since you mentioned that you're able to build on Windows as well, I will note that Microsoft's CRT has adequate leak-checking tools built-in. Instructions for use found here:
http://msdn.microsoft.com/en-us/library/974tc9t1(v=vs.100).aspx
There are also open-source replacements for malloc available for use with GCC/Clang toolchains, though I have no direct experience with them. I think on Linux Valgrind is the preferred and more reliable method for leak-detection anyway. (and in my experience easier to use than MSVCRT Debug).
I would suggest using valgrind with memcheck tool or any other profiling tool for memory leaks
from Valgrind's page:
Memcheck
detects memory-management problems, and is aimed primarily at
C and C++ programs. When a program is run under Memcheck's
supervision, all reads and writes of memory are checked, and calls to
malloc/new/free/delete are intercepted. As a result, Memcheck can
detect if your program:
Accesses memory it shouldn't (areas not yet allocated, areas that have been freed, areas past the end of heap blocks, inaccessible areas
of the stack).
Uses uninitialised values in dangerous ways.
Leaks memory.
Does bad frees of heap blocks (double frees, mismatched frees).
Passes overlapping source and destination memory blocks to memcpy() and related functions.
Memcheck reports these errors as soon as they occur, giving the source
line number at which it occurred, and also a stack trace of the
functions called to reach that line. Memcheck tracks addressability at
the byte-level, and initialisation of values at the bit-level. As a
result, it can detect the use of single uninitialised bits, and does
not report spurious errors on bitfield operations. Memcheck runs
programs about 10--30x slower than normal. Cachegrind
Massif
Massif is a heap profiler. It performs detailed heap profiling by
taking regular snapshots of a program's heap. It produces a graph
showing heap usage over time, including information about which parts
of the program are responsible for the most memory allocations. The
graph is supplemented by a text or HTML file that includes more
information for determining where the most memory is being allocated.
Massif runs programs about 20x slower than normal.
Using valgrind is as simple as running application with desired switches and give it as an input of valgrind:
valgrind --tool=memcheck ./myapplication -f foo -b bar
I very much doubt that anything beyond wrapping malloc and free [or new and delete ] with another function can actually get you anything other than very rough estimates.
One of the problems is that the memory that is freed can only be released if there is a long contiguous chunk of memory. What typically happens is that there are "little bits" of memory that are used all over the heap, and you can't find a large chunk that can be freed.
It's highly unlikely that you will be able to fix this in any simple way.
And by the way, your application is probably going to need those 50MB later on when you have more load again, so it's just wasted effort to free it.
(If the memory that you are not using is needed for something else, it will get swapped out, and pages that aren't touched for a long time are prime candidates, so if the system runs low on memory for some other tasks, it will still reuse the RAM in your machine for that space, so it's not sitting there wasted - it's just you can't use 'ps' or some such to figure out how much ram your program uses!)
As suggested in a comment: You can also write your own memory allocator, using mmap() to create a "chunk" to dole out portions from. If you have a section of code that does a lot of memory allocations, and then ALL of those will definitely be freed later, to allocate all those from a separate lump of memory, and when it's all been freed, you can put the mmap'd region back into a "free mmap list", and when the list is sufficiently large, free up some of the mmap allocations [this is in an attempt to avoid calling mmap LOTS of times, and then munmap again a few millisconds later]. However, if you EVER let one of those memory allocations "escape" out of your fenced in area, your application will probably crash (or worse, not crash, but use memory belonging to some other part of the application, and you get a very strange result somewhere, such as one user gets to see the network content supposed to be for another user!)
Use valgrind to find memory leaks : valgrind ./your_application
It will list where you allocated memory and did not free it.
I don't think it's a linux problem, but in your application. If you monitor the memory usage with « top » you won't get very precise usages. Try using massif (a tool of valgrind) : valgrind --tool=massif ./your_application to know the real memory usage.
As a more general rule to avoid leaks in C++ : use smart pointers instead of normal pointers.
Also in many situations, you can use RAII (http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) instead of allocating memory with "new".
It is not typical for an OS to release memory when you call free or delete. This memory goes back to the heap manager in the runtime library.
If you want to actually release memory, you can use brk. But that opens up a very large can of memory-management worms. If you directly call brk, you had better not call malloc. For C++, you can override new to use brk directly.
Not an easy task.
The latest dlmalloc() has a concept called an mspace (others call it a region). You can call malloc() and free() against an mspace. Or you can delete the mspace to free all memory allocated from the mspace at once. Deleting an mspace will free memory from the process.
If you create an mspace with a connection, allocate all memory for the connection from that mspace, and delete the mspace when the connection closes, you would have no process growth.
If you have a pointer in one mspace pointing to memory in another mspace, and you delete the second mspace, then as the language lawyers say "the results are undefined".

what will OS do in Windows when I use the operator new to allocate memory with the size of 1k bytes?

It seems that memory leak occurs in my code, so I try to locate the place in my code which causes the memory leak.
In the post
Can't obtain accurate information of available memory in the heap
I was told that OS may allocate large memory when a small memory is request to reduce the system call.
Is it correct in Windows?
What's relevant here, after seeing your other question, is not what happens when you allocate memory. What matters is what happens when you release it. In particular a 1 KB allocation will never be released back to the OS, it is too small. It gets added to a list of free blocks, ready to be used by the next allocation of (about) the same size.
You cannot reliably detect memory leaks with VirtualQuery().
If you use Visual Studio then use its built-in leak detection feature. There are plenty of other tools.
On most systems (including most recent compilers on Windows), the heap manager will allocate relatively large "chunks" of memory from the OS, then divide that up into pieces for use by the program. That allocation from the OS will typically be at least tens of kilobytes.
Those large chunks of memory will be returned to the OS when the program ends execution. It can happen sooner than that, but end of execution is the most common.
Each of those large chunks will be tracked by the OS as a single allocation (even though the heap manager will then break it up into smaller pieces for use by your code). Any that have been released back to the OS will show up as free memory blocks.

libGL heap usage

I am working on a linux-based c++ OpenGL application, utilizing the Nvidia 290.10 64bit drivers. I am trying to reduce its memory footprint as it makes use of quite a lot of live data.
I've been using valgrind/massif to analyze heap usage, and while it helped me optimize various things, by now the largest chunk of heap memory used is allocated by libGL. No matter how I set the threshold, massif doesn't let me see in detail where those allocations come from, just that it's libGL. At peak times, I see about 250MB allocated by libGL (out of 900MB total heap usage). I hold a similar amount of memory on the graphics card, as VBOs and Textures (mostly one big 4096*4096 texture).
So it appears as if a similar amount of memory as what I upload to GPU memory is allocated on the heap by libGL. The libGL allocations also peak when the volume of VBOs peaks. Is that normal? I thought one of the benefits of having a lot of GPU memory is that it keeps the RAM free?
What you experience is perfectly normal, because a OpenGL implementation must keep a copy of the data in system memory for various reasons.
In OpenGL there's no exclusive access to the GPU, so depending on its use, it may become neccessary to swap out data (or just release some objects from GPU memory). Also GPUs may crash and drivers then just silently reset them without the user noticing. This too requires a full copy of all the buffer data.
And don't forget that there's a major difference between address space allocation (the value reported by Valgrind) and actual memory utilization.