I am trying to use Tensorflow for inference within my C++ application. Other parts of the application need access to large amounts of GPU memory (not at exactly the same time as Tensorflow). However, once Tensorflow has been used to perform inference, it hogs the GPU memory and does not release it until the application ends. Ideally, after inference, I would be able to free the GPU memory used by Tensorflow to allow other algorithms to use the GPU.
Has anyone else faced this problem, and did you find a solution?
Tensorflow allocates memory for the lifetime of the process. There is unfortunately no way around that, you only get the memory back once the process finishes.
One way to solve this would be to "modularize" your application into multiple distinct processes. Have one process for performing inference, and a parent process (your application) which calls it. You can run the child process blocking, so your entire app behaves as if it was executing the code itself (apart from handling resource sharing of course).
Related
I am currently writing a custom PyTorch dataloader that loads the training data for machine learning from a 2GB JSON file.
The dataloader, which is basically a CPython extension module written in C++, loads the entire JSON file into a temporary data structure and converts the data into another in-memory format which is compatible with the PyTorch model I'm using.
I've managed to make the program load and convert the data at a reasonable speed thanks to the brilliant free libraries out there, but it turned out that the program consumes too much memory when I tried to scale up the training.
When PyTorch performs multi-GPU/multi-node training, the library allocates one Python process for each GPU, which means there are several separate instances of my dataloader running at the same time.
My machine has enough RAM for one dataloader to run without problems, but not enough RAM to run several of them.
I could confirm that once the RAM space is exhausted, the dataloaders would start using up several GBs of swap space and it degrades the performance severely.
So I started to see where I could save some RAM space in my dataloader.
I found out that the temporary data structure to which the JSON data is initially loaded was totally unnecessary after the conversion is finished, so I want to free up this memory space for the other processes.
The question is, how am I supposed to do this with the standard library? The data structure basically consists of std::vectors and std::unordered_maps on the heap, but just destructing them does not free up the memory space because there is no heap compaction mechanism implemented in Linux glibc.
On Windows I could implement a custom std::allocator that resides in a separate heap and just destroy the entire heap after use (though I'm not sure this actually works), but glibc malloc() does not take a heap handle parameter.
I don't believe this question is asked for the first time, and I don't either think that implementing a custom std::allocator based on a third-party heap allocator is the only answer. How could I free up some heap space for another process, in Linux glibc? Could you give me pointers?
Thanks in advance.
I develop a C++ framework that is used to run user code in a well defined environment (Linux boxes under our supervision).
I would like to prevent badly written modules to start eating up all memory of a machine. As I develop the framework could I simply force the program to stop itself if its memory consumption is too high ? What api or tool should I use for this ?
A simple mechanism for controlling a process's resource limits is provided by setrlimit. But that doesn't isolate the process (as you should for untrusted third-party code), it only puts some restrictions on it. To properly isolate a process from the rest of the system, you should either use a VM, or make use of cgroups and kernel namespaces — preferrably not by hand, but via some existing library or framework (such as Docker).
How to have my program stops if its memory consumption exceeds a limit ?
When you define the interface between the application and it's modules, ensure that one of the first steps (probably the first) will be to pass an allocator-like class instance, from the application to the modules.
This instance should be used in the module to allocate and deallocate all necessary memory.
This will allow implementations of this allocator instance, to report memory allocations to the main application, which should be able to triggering an exception, if a limit (per module or per application) is reached.
You can directly provide your own operator new. However, that won't protect you from calls to malloc, or direct OS calls. This would require patching or wrapping glibc (since you're on Linux). Doable but not nice.
What's your desired security level? Are you protecting against Murphy or Machiavelli ? Might a plugin use a third-party library which allocates memory on bahalf of the plugin? Do you need to keep track of the plugin that allocated the memory?
I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.
All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.
I'm writing a 32-bit .NET program with a 2 stage input process:
It uses native C++ via C++/CLI to parse an indefinite number files into corresponding SQLite databases (all with the same schema). The allocations by C++ 'new' will typically consume up to 1GB of the virtual address space (out of 2GB available; I'm aware of the 3GB extension but that'll just delay the issue).
It uses complex SQL queries (run from C#) to merge the databases into a single database. I set the cache_size to 1GB for the merged database so that the merging part has minimal page faults.
My problem is that the cache in stage 2 does not re-use the 1GB of memory allocated by 'new' and properly released by 'delete' in stage 1. I know there's no leak because immediately after leaving stage 1, 'private bytes' drops down to a low amount like I'd expect. 'Virtual size' however remains at about the peak of what the C++ used.
This non-sharing between the C++ and SQLite cache causes me to run out of virtual address space. How can I resolve this, preferably in a fairly standards-compliant way? I really would like to release the memory allocated by C++ back to the OS.
This is not something you can control effectively from the C++ level of abstraction (in other words you cannot know for sure if memory that your program released to the C++ runtime is going to be released to the OS or not). Using special allocation policies and non-standard extensions to try to handle the issue is probably not working anyway because you cannot control how the external libraries you use deal with memory (e.g. if the have cached data).
A possible solution would be moving the C++ part to an external process that terminates once the SQLite databases have been created. Having an external process will introduce some annoyiance (e.g. it's a bit harder to keep a "live" control on what happens), but also opens up more possibilities like parallel processing even if libraries are not supporting multithreading or using multiple machines over a network.
Since you're interoperating with C++/CLI, you're presumably using Microsoft's compiler.
If that's the case, then you probably want to look up _heapmin. After you exit from your "stage 1", call it, and it'll release blocks of memory held by the C++ heap manager back to the OS, if the complete block that was allocated from the OS is now free.
On Linux, we used google malloc (http://code.google.com/p/google-perftools/). It has a function to release the free memory to the OS: MallocExtension::instance()->ReleaseFreeMemory().
In theory, gcmalloc works on Windows, but I never personally used it there.
You could allocate it off the GC from C#, pin it, use it, and then allow it to return, thus freeing it and letting the GC compact it and re-use the memory.
How does one create and launch process (i.e. launch an .exe file) with RAM limitation using c++ and the win32 API?
Which error code will be returned, if the proccess goes beyond the limit?
Job Objects are the right way to go.
As for an error code, there really isn't one. You create the process (with CreateProcess) and the job (with CreateJobObject), then associate the process with the job object (with AssignProcessToJobObject).
The parent process won't get an error message if the child allocates more than the allowed amount of memory. In fact, the limit will be enforced even if the parent process exits. If the child process tries to allocate more than the allowed amount of memory, the allocation will simply fail.
You can use CreateProcess() to spawn a process.
Once you have done that, you can use SetProcessWorkingSetSize() to attempt to control how much physical memory it uses, but this is more of a really strong suggestion to the VMM than some actual edict that will cause malloc() and new to start failing.
There is no way to say "this process will take 4mb of memory and after that all allocations fail". I mean, you're going to link to win32 dlls and you have no idea what sort of memory usage those things require. If you want your app to only take some certain amount of memory, don't allocate more than that. And don't do things that allocate memory.
Your question regarding the error code makes no sense at all.
NT Job objects ( SetInformationJobObject & JOBOBJECT_BASIC_LIMIT_INFORMATION )
By my knowledge there is no such possibility on windows. It would be very useful to have though, for testing and other things.
You have this on java as a JVM uses only a predefined amount of memory, but there its not a feature, but rather a problem ;-)
If you launch a process, you lose control over this process. Just the operating system can control its behavior (memory footprint i.e.) but even in that case I can't imagine how this could be achieved, as jeffamaphone stated, any limitation is at best a suggestion, not a instruction. A process can call external static libraries, COM instances, etc, so I don't imagine how this limit could be verified/enforced.