Memory leak when using Node.JS native modules

Memory leak when using Node.JS native modules - c++

At my job we have a lot of C++ code that is callable from Node.JS through native extensions. Many of the created objects contain pointers to large amounts of memory (for example, pointcloud data from a 3D-camera points to a buffer that's over a megabyte in size). Things have been set up in such a way that when the JS-object is GC'ed, the underlying native object should be destroyed as well (by using our own reference counter under the hood). Yesterday it turned out that it didn't work as well as we thought because we ran out of memory by leaking about a megabyte every second.
Unfortunately, I'm having trouble finding info about how to properly deal with these kinds of large objects in Node.JS. I've found a function called napi_adjust_external_memory which lets me tell how much data is in use and which "will trigger global garbage collections more often than it would otherwise", but I don't know if this can be used if no other parts of the N-API are used. It's not clear to me either if the OOM-error is caused by errors in our C++-codebase, or Node.JS falsely assuming it's using less memory than it actually is and not triggering the GC for those objects as a result.
So, in summary, my question is as follows:
Is it possible that these objects were never collected despite the memory pressure?
Can I use napi_adjust_external_memory to trigger the GC more regularly without using other parts of the N-API?
How should I deal with large native objects in Node.JS in order to ensure they don't leak?

Related

How could I free up some heap space used by STD structures for other processes?

I am currently writing a custom PyTorch dataloader that loads the training data for machine learning from a 2GB JSON file.
The dataloader, which is basically a CPython extension module written in C++, loads the entire JSON file into a temporary data structure and converts the data into another in-memory format which is compatible with the PyTorch model I'm using.
I've managed to make the program load and convert the data at a reasonable speed thanks to the brilliant free libraries out there, but it turned out that the program consumes too much memory when I tried to scale up the training.
When PyTorch performs multi-GPU/multi-node training, the library allocates one Python process for each GPU, which means there are several separate instances of my dataloader running at the same time.
My machine has enough RAM for one dataloader to run without problems, but not enough RAM to run several of them.
I could confirm that once the RAM space is exhausted, the dataloaders would start using up several GBs of swap space and it degrades the performance severely.
So I started to see where I could save some RAM space in my dataloader.
I found out that the temporary data structure to which the JSON data is initially loaded was totally unnecessary after the conversion is finished, so I want to free up this memory space for the other processes.
The question is, how am I supposed to do this with the standard library? The data structure basically consists of std::vectors and std::unordered_maps on the heap, but just destructing them does not free up the memory space because there is no heap compaction mechanism implemented in Linux glibc.
On Windows I could implement a custom std::allocator that resides in a separate heap and just destroy the entire heap after use (though I'm not sure this actually works), but glibc malloc() does not take a heap handle parameter.
I don't believe this question is asked for the first time, and I don't either think that implementing a custom std::allocator based on a third-party heap allocator is the only answer. How could I free up some heap space for another process, in Linux glibc? Could you give me pointers?
Thanks in advance.

Mixed-Mode Process vs. Managed-to-Unmanaged IPC

I am trying to come up with design candidates for a current project that I am working on. Its client interface is based on WCF Services exposing public methods and call backs. Requests are routed all the way to C++ libraries (that use boost) that perform calculations, operations, etc.
The current scheme is based on a WCF Service talking to a separate native C++ process via IPC.
To make things a little simpler, there is a recommendation around here to go mixed-mode (i.e. to have a single .NET process which loads the native C++ layer inside it, most likely communicating to it via a very thin C++/CLI layer). The main concern is whether garbage collection or other .NET aspects would hinder the performance of the unmanaged C++ part of the process.
I started looking up concepts of safe points and and GC helper methods (e.g. KeepAlive(), etc.) but I couldn't find any direct discussion about this or benchmarks. From what I understand so far, one of the safe points is if a thread is executing unamanged code and in this case garbage collection does not suspend any threads (is this correct?) to perform the cleanup.
I guess the main question I have is there a performance concern on the native side when running these two types of code in the same process vs. having separate processes.

If you have a thread that has never executed any managed code, it will not be frozen during .NET garbage collection.
If a thread which uses managed code is currently running in native code, the garbage collector won't freeze it, but instead mark the thread to stop when it next reaches managed code. However, if you're thinking of a native dispatch loop that doesn't return for a long time, you may find that you're blocking the garbage collector (or leaving stuff pinned causing slow GC and fragmentation). So I recommend keeping your threads performing significant tasks in native code completely pure.
Making sure that the compiler isn't silently generating MSIL for some standard C++ code (thereby making it execute as managed code) is a bit tricky. But in the end you can accomplish this with careful use of #pragma managed(push, off).

It is very easy to get a mixed mode application up and running, however it can be very hard to get it working well.
I would advise thinking carefully before choosing that design - in particular about how you layer your application and the sort of lifetimes you expect for your unmanaged objects. A few thoughts from past experiences:
C++ object lifetime - by architecture.
Use C++ objects briefly in local scope then dispose of them immediately.
It sounds obvious but worth stating, C++ objects are unmanaged resources that are designed to be used as unmanaged resources. Typically they expect deterministic creation and destruction - often making extensive use of RAII. This can be very awkward to control from a a managed program. The IDispose pattern exists to try and solve this. This can work well for short lived objects but is rather tedious and difficult to get right for long lived objects. In particular if you start making unmanaged objects members of managed classes rather than things that live in function scope only, very quickly every class in your program has to be IDisposable and suddenly managed programming becomes harder than ummanaged programming.
The GC is too aggressive.
Always worth remembering that when we talk about managed objects going out of scope we mean in the eyes of the IL compiler/runtime not the language that you are reading the code in. If an ummanaged object is kept around as a member and a managed object is designed to delete it things can get complicated. If your dispose pattern is not complete from top to bottom of your program the GC can get rather aggressive. Say for example you try to write a managed class which deletes an unmanaged object in its finaliser. Say the last thing you do with the managed object is access the unmanaged pointer to call a method. Then the GC may decide that during that unmanaged call is a great time to collect the managed object. Suddenly your unmanaged pointer is deleted mid method call.
The GC is not aggressive enough.
If you are working within address constraints (e.g. you need a 32 bit version) then you need to remember that the GC holds on to memory unless it thinks it needs to let go. Its only input to these thoughts is the managed world. If the unmanaged allocator needs space there is no connection to the GC. An unmanaged allocation can fail simply because the GC hasn't collected objects that are long out of scope. There is a memory pressure API but again it is only really usable/useful for quite simple designs.
Buffer copying. You also need to think about where to allocate any large memory blocks. Managed blocks can be pinned to look like unmanaged blocks. Unmanaged blocks can only ever be copied if they need to look like managed blocks. However when will that large managed block actually get released?

How to preallocate objects in C++ instead of using new?

A piece of C++ I've been asked to look at is performing poorly due to an inordinate number of invocations of "new" on objects we use to store information about a node in an XML DOM tree. I've verified that new is the cause, using both AQTime and Very Sleepy profilers.
These objects all contain several other object types and pointers to objects as members, so each new on a node object will invoke the constructors of all the member objects as well, which I guess is the reason each allocation takes so long. It also means we can't just call something like GlobalAlloc and request a big chunk of memory - it needs to be initialised afterwards.
I've been investigating using preallocation techniques to mitigate this poor performance, but the ones I've seen involve requesting big chunks of uninitialised memory which isn't suitable for what I need, while others ultimately end up calling new anyway, cancelling out any performance gain we might observe so I'm wondering if there is another option I'm unaware of? I have a feeling what I'm asking for can't be done, that it's either retrieving uninitialised memory quickly or initialised memory slowly. Please prove me wrong :)
Thanks

Releasing virtual memory reserved by C++ 'new' on Windows

I'm writing a 32-bit .NET program with a 2 stage input process:
It uses native C++ via C++/CLI to parse an indefinite number files into corresponding SQLite databases (all with the same schema). The allocations by C++ 'new' will typically consume up to 1GB of the virtual address space (out of 2GB available; I'm aware of the 3GB extension but that'll just delay the issue).
It uses complex SQL queries (run from C#) to merge the databases into a single database. I set the cache_size to 1GB for the merged database so that the merging part has minimal page faults.
My problem is that the cache in stage 2 does not re-use the 1GB of memory allocated by 'new' and properly released by 'delete' in stage 1. I know there's no leak because immediately after leaving stage 1, 'private bytes' drops down to a low amount like I'd expect. 'Virtual size' however remains at about the peak of what the C++ used.
This non-sharing between the C++ and SQLite cache causes me to run out of virtual address space. How can I resolve this, preferably in a fairly standards-compliant way? I really would like to release the memory allocated by C++ back to the OS.

This is not something you can control effectively from the C++ level of abstraction (in other words you cannot know for sure if memory that your program released to the C++ runtime is going to be released to the OS or not). Using special allocation policies and non-standard extensions to try to handle the issue is probably not working anyway because you cannot control how the external libraries you use deal with memory (e.g. if the have cached data).
A possible solution would be moving the C++ part to an external process that terminates once the SQLite databases have been created. Having an external process will introduce some annoyiance (e.g. it's a bit harder to keep a "live" control on what happens), but also opens up more possibilities like parallel processing even if libraries are not supporting multithreading or using multiple machines over a network.

Since you're interoperating with C++/CLI, you're presumably using Microsoft's compiler.
If that's the case, then you probably want to look up _heapmin. After you exit from your "stage 1", call it, and it'll release blocks of memory held by the C++ heap manager back to the OS, if the complete block that was allocated from the OS is now free.

On Linux, we used google malloc (http://code.google.com/p/google-perftools/). It has a function to release the free memory to the OS: MallocExtension::instance()->ReleaseFreeMemory().
In theory, gcmalloc works on Windows, but I never personally used it there.

You could allocate it off the GC from C#, pin it, use it, and then allow it to return, thus freeing it and letting the GC compact it and re-use the memory.

What can "new_handler" be used for in C++ besides garbage collection?

C++ programs can define and set a new_handler() that should be called from memory allocation functions like operator new() if it's impossible to allocate memory requested.
One use of a custom new_handler() is dealing with C++ implementations that don't throw an exception on allocation failure. Another use is initiating garbage collection on systems that implement garbage collection.
What other uses of custom new_handler() are there?

In a similar vein to the garbage collection application, you can use the new handler to free up any cached data you may be keeping.
Say that you're caching some resource data read from disk or the intermediate results of some calculation. This is data that you can recreate at any time, so when the new handler gets called (signifying that you've run out of heap), you can free the memory for the cached data, then return from the new handler. Hopefully, new will now be able to make the allocation.
In many cases, virtual memory can serve the same purpose: You can simply let your cached data get swapped out to disk -- if you've got enough virtual address space. On 32-bit systems, this isn't really the case any more, so the new handler is an interesting option. Many embedded systems will face similar constraints.

On most of the servers I've worked on, the new_handler freed
a pre-allocated block (so future new's wouldn't fail) before
logging a message (the logger used dynamic memory) and aborting.
This ensured that the out of memory error was correctly logged
(instead of the process just "disappearing", with an error
message to cerr, which was connected to /dev/null).
In applications such as editors, etc., it may be possible to
spill parts of some buffers to disk, then continue; if the
new_handler returns, operator new is supposed to retry the
allocation, and if the new_handler has freed up enough memory,
the allocation may succeed (and if it doesn't, the new_handler
will be called again, to maybe free up even more).

I've never used it for anything - too many OS will grant virtual memory and SIGSEGV or similar if they can't provide it later so it's not a good idea to make a system that relies on being memory-exhaustion tolerant: it's often outside C++'s hands. Still, if you developed for a system where it could/must be relied upon, I can easily imagine a situation where some real-time data was being streamed into a queue in your process, and you were processing it and writing/sending out results as fast as you could (e.g. video hardware streaming video for recompression to disk/network). If you got to the stage where you couldn't store any more, you'd simply have to drop some, but how would you know when it got that bad? Setting an arbitrary limit would be kind of silly, especially if your software was for an embedded environment / box that only existed to do this task. And, you probably shouldn't use a function like this casually on a system with any kind of hard-disk based swap memory, as if you're already into swap you're throughput rates would be miserable ever after. But - past the caveats - it might be useful to drop packets for a while until you catch up. Perhaps dropping every Nth frame through the queued buffer would be less visible than dropping a chunk at the back or front of the queue. Whatever, discarding data from the queue could be a sane application-level use (as distinct from intra-memory-subsystem) for something like this....

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js