Dumping buffered data when memory limit is near - c++

My application buffers data for likely requests in the background. Currently I limit the size of the buffer based on a command-line parameter, and begin dumping less-used data when we hit this limit. This is not ideal because it relies on the user to specify a performance-critical parameter. Is there a better way to handle this? Is there a way to automatically monitor system memory use and dump the oldest/least-recently-used data before the system starts to thrash?
A complicating factor here is that my application runs on Linux, OSX, and Windows. But I'll take a good way to do this on only one platform over nothing.

Your best bet would likely be to monitor your applications working set/resident set size, and try to react when it doesn't grow after your allocations. Some pointers on what to look for:
Windows: GetProcessMemoryInfo
Linux: /proc/self/statm
OS X: task_info()
Windows also has GlobalMemoryStatusEx which gives you a nice Available Physical Memory figure.

I like your current solution. Letting the user decide is good. It's not obvious everyone would want the buffer to be as big as possible, is it? If you do invest in implemting some sort of memory monitor for automatically adjusting the buffer/cache size, at least let the user choose between the user set limit and the automatic/dynamic one.

I know this isn't a direct answer, but I'd say step back a bit and maybe don't do this.
Even if you have the API to see current physical memory usage, that's not enough to choose an ideal cache size. That would depend on your typical and future workloads for both the program and the machine (and the overall system of all clients running this program + the server(s) they're querying), the platform's caching behavior, whether the system should be tuned for throughput or latency, and so on. In a tight memory situation, you're going to be competing for memory with other opportunistic caches, including the OS's disk cache. On the one hand, you want to be exerting some pressure on them, to force out other low-value data. On the other hand, if you get greedy while there's plenty of memory, you're going to be affecting the behavior of other adaptive caches.
And with speculative caching/prefetching, the LRU value function is odd: you will (hopefully) fetch the most-likely-to-be-called data first, and less-likely data later, so the LRU data in your prefetch cache may be less valuable than older data. This could lead to perverse behavior in the systemwide set of caches by artificially "heating up" less commonly used data.
It seems unlikely that your program would be able to make a cache size choice better than a simple fixed size, perhaps scaled based on the size of overall physical memory on the machine. And there's very little chance it could beat a sysadmin who knows the machine's typical workload and its performance goals.
Using an adaptive cache sizing strategy means that your program's resource usage is going to be both variable and unpredictable. (With respect to both memory and the I/O and server requests used to populate that prefetch cache.) For a lot of server situations, that's not good. (Especially in HPC or DB servers, which this sounds like it might be for, or a high-utilization/high-throughput environment.) Consistency, configurability, and availability are often more important than maximum resource utilization. And locality of reference tends to fall off quickly, so you're likely getting very diminishing returns with larger cache sizes. If this is going to be used server-side, at least leave the option for explicit control of cache sizes, and probably make that the default, if not only, option.

There is a way: it is called virtual memory (vm). All three operating systems listed will use virtual memory (vm), unless there is no hardware support (which may be true in embedded systems). So I will assume that vm support is present.
Here is a quote from the architecture notes of the Varnish project:
The really short answer is that computers do not have two kinds of storage any more.
I would suggest you read the full text here: http://www.varnish-cache.org/trac/wiki/ArchitectNotes
It is a good read, and I believe will answer your question.

You could attempt to allocate some large-ish block of memory then check for a memory allocation exception. If the exception occurs, dump data. The problem is, this will only work when all system memory (or process limit) is reached. This means your application is likely to start swapping.
try {
char *buf = new char[10 * 1024 * 1024]; // 10 megabytes
free(buf);
} catch (const std::bad_alloc &) {
// Memory allocation failed - clean up old buffers
}
The problems with this approach are:
Running out of system memory can be dangerous and cause random applications to be shut down
Better memory management might be a better solution. If there is data that can be freed, why has it not already been freed? Is there a periodic process you could run to clean up unneeded data?

Related

Limit buffer cache used for mmap

I have a data structure that I'd like to rework to page out on-demand. mmap seems like an easy way to run some initial experiments. However, I want to limit the amount of buffer cache that the mmap uses. The machine has enough memory to page the entire data structure into cache, but for test reasons (and some production reasons too) I don't want to allow it to do that.
Is there a way to limit the amount of buffer cache used by mmap?
Alternatively, an mmap alternative that can achieve something similar and still limit memory usage would work too.
From my understanding, it is not possible. Memory mapping is controlled by the operating system. The kernel will make the decisions how to use the available memory in the best way, but it looks at the system in total. I'm not aware that quotas for caches on a process level are supported (at least, I have not seen such APIs in Linux or BSD).
There is madvise to give the kernel hints, but it does not support to limit the cache used for one process. You can give it hints like MADV_DONTNEED, which will reduce the pressure on the cache of other applications, but I would expect that it will do more harm than good, as it will most likely make caching less efficient, which will lead to more IO load on the system in total.
I see only two alternatives. One is trying to solve the problem at the OS level, and the other is to solve it at the application level.
At the OS level, I see two options:
You could run a virtual machine, but most likely this is not what you want. I would also expect that it will not improve the overall system performance. Still, it would be at least a way to define upper limits on the memory consumption.
Docker is the another idea that comes to mind, also operating at the OS level, but to the best of my knowledge, it does not support defining cache quotas. I don't think it will work.
That leaves only one option, which is to look at the application level. Instead of using memory mapped files, you could use explicit file system operations. If you need to have full control over the buffer, I think it is the only practical option. It is more work than memory mapping, and it is also not guaranteed to perform better.
If you want to stay with memory mapping, you could also map only parts of the file in memory and unmap other parts when you exceed your memory quota. It also has the same problem as the explicit file IO operations (more implementation work and non-trivial tuning to find a good caching strategy).
Having said that, you could question the requirement to limit the cache memory usage. I would expect that the kernel does a pretty good job at allocating memory resources in a good way. At least, it will likely be better than the solutions that I have sketched. (Explicit file IO, plus an internal cache, might be fast, but it is not trivial to implement and tune. Here, is a comparison of the trade-offs: mmap() vs. reading blocks.)
During testing, you could run the application with ionice -c 3 and nice -n 20 to somewhat reduce the impact on the other productive applications.
There is also a tool called nocache. I never used it but when reading through its documentation, it seems somewhat related to your question.
It might be possible to accomplish this through the use of mmap() and Linux Control Groups (more generally, here or here). Once installed, you have the ability to create arbitrary limits on the amount of, among other things, physical memory used by a process. As an example, here we limit the physical memory to 128 megs and swap memory to 256 megs:
cgcreate -g memory:/limitMemory
echo $(( 128 * 1024 * 1024 )) > /sys/fs/cgroup/memory/limitMemory/memory.limit_in_bytes
echo $(( 256 * 1024 * 1024 )) > /sys/fs/cgroup/memory/limitMemory/memory.memsw.limit_in_bytes
I would go the route of only map parts of the file at a time so you can retain full control on exactly how much memory is used.
you may use ipc shared memory segment, you will be the master of your memory segments.

Allocating memory that can be freed by the OS if needed

I'm writing a program that generates thumbnails for every page in a large document. For performance reasons I would like to keep the thumbnails in memory for as long as possible, but I would like the OS to be able to reclaim that memory if it decides there is another more important use for it (e.g. the user has started running a different application.)
I can always regenerate the thumbnail later if the memory has gone away.
Is there any cross-platform method for flagging memory as can-be-removed-if-needed? The program is written in C++.
EDIT: Just to clarify, rather than being notified when memory is low or regularly monitoring the system's amount of memory, I'm thinking more along the lines of allocating memory and then "unlocking" it when it's not in use. The OS can then steal unlocked memory if needed (even for disk buffers if it thinks that would be a better use of the memory) and all I have to do as a programmer is just "lock" the memory again before I intend to use it. If the lock fails I know the memory has been reused for something else so I need to regenerate the thumbnail again, and if the lock succeeds I can just keep using the data from before.
The reason is I might be displaying maybe 20 pages of a document on the screen, but I may as well keep thumbnails of the other 200 or so pages in case the user scrolls around a bit. But if they go do something else for a while, that memory might be better used as a disk cache or for storing web pages or something, so I'd like to be able to tell the OS that it can reuse some of my memory if it wants to.
Having to monitor the amount of free system-wide memory may not achieve the goal (my memory will never be reclaimed to improve disk caching), and getting low-memory notifications will only help in emergencies. I was hoping that by having a lock/unlock method, this could be achieved in more of a lightweight way and benefit the performance of the system in a non-emergency situation.
Is there any cross-platform method for flagging memory as can-be-removed-if-needed? The program is written in C++
For Windows, at least, you can register for a memory resource notification.
HANDLE WINAPI CreateMemoryResourceNotification(
_In_ MEMORY_RESOURCE_NOTIFICATION_TYPE NotificationType
);
NotificationType
LowMemoryResourceNotification Available physical memory is running low.
HighMemoryResourceNotification Available physical memory is high.
Just be careful responding to both events. You might create a feedback loop (memory is low, release the thumbnails! and then memory is high, make all the thumbnails!).
In AIX, there is a signal SIGDANGER that is send to applications when available memory is low. You may handle this signal and free some memory.
There is a discussion among Linux people to implement this feature into Linux. But AFAIK it is not yet implemented in Linux. Maybe they think that application should not care about low level memory management, and it could be transparently handled in OS via swapping.
In posix standard there is a function posix_madvise might be used to mark an area of memory as less important. There is an advice POSIX_MADV_DONTNEED specifies that the application expects that it will not access the specified range in the near future.
But unfortunately, current Linux implementation will immediately free the memory range when posix_madvise is called with this advice.
So there's no portable solution to your question.
However, on almost every OS you are able to read the current available memory via some OS interface. So you can routinely read such value and manually free memory when available memory in OS is low.
There's nothing special you need to do. The OS will remove things from memory if they haven't been used recently automatically. Some OSes have platform-specific ways to improve this, but generally, nothing special is needed.
This question is very similar and has answers that cover things not covered here.
Allocating "temporary" memory (in Linux)
This shouldn't be too hard to do because this is exactly what the page cache does, using unused memory to cache the hard disk. In theory, someone could write a filesystem such that when you read from a certain file, it calculated something, and the page cache would cache it automatically.
All the basics of automatically freed cache space are already there in any OS with a disk cache, and It's hard to imagine there not being an API for something that would make a huge difference especially in things like mobile web browsers.

How to allocate a memory block and place it into Cache?

I want to dynamically allocate a memory block for an array in C/C++, and this array will be accessed at a high frequency. So I want this array to stay on chip, i.e., in the Cache. How can I do this explicitly with code in C/C++?
There is no standard C++ language feature that allows you to do this.
Depending on your compiler and CPU, you may be able to use an arch-specific CPU instruction in an asm block:
T* p = new T(...);
size_t n = sizeof(T);
asm {
"CACHE n bytes at address p"
}
...or some builtin compiler function ("intrinsic") that does this.
You will need to consult your CPU manual and/or your compiler manual.
As an example, x86 CPUs have a set of instructions starting with PREFETCH.
And another example, GCC has a function called __builtin_prefetch. See GCC Data Prefetch Support
I will try to answer this question from a bit different perspective. Do you really need to do this. And even if it would be a way to do so, will it worth it? Imagine there is a "magic" void * malloc_and_lock_in_cache( int cacheLevel ) function. What you going to do with this data. If it's an application limited to while (1) loop with random array access from single thread you will have such behaviour anyway due to optimisation and CPU architecture. If you think about more real world solutions you always have logic around access. For example locking for multithreading, certain conditions, etc. The the question - do the rest of your application algorithms are so perfect that only left to do is to allocate array in cache.
Do all other access/sorting/lookup functions are state-of-art logic which cannot be reviewed rather then gaining very limited performance kickback trying to overwrite CPU optimisation.
Also do you consider to run your application without ANY operation system on a raw hardware so you shouldn't care about how your allocation will affects OS behaviour, rest of application running around?
And what should happen if your application will run inside virtual machine or environments like XEN.?
I can remember one similar popular subject 15-18 years ago about physical memory usage and disk caching utilities. Indeed tools like MS-DOS smartdrive or similar utilities were REALLY useful and speed up things a lot. Usenet was full of 'tuning advices' and performance analyses for things like write-through/write-back settings.
Especially if your DOS application were processing large amounts of data and implemented some memory swapping logic (I am talking about times then 4MB RAM was luxury) that's became mostly a drama, that from one point of view you need as much memory you can, but from another point of view you need swapping, so you actually need to swap, but swapping goes through cache etc..
But what happened next. We've got VM386 mode, disk cache/memory swaps integrated into OS, and who was care anymore about things like tuning smartdrive/ramdisks. In general it was 'cheaper' to allocate as much as you need VM then implement own voodoo algorithms to swap physical memory blocks (although this functionality is still in WinAPI).
So I would really recommend to concentrate efforts on algorithms and application design rather then trying to use some very low level features with really unpredictable results until you dont develop some new microkernel OS.
I don't think you can. First, which cache? L3, L2, L1? You can prefetch, and align so it its access is more optimized, and then you can query it periodically maybe to make it stay and not go LRU'd, but you can't really make it stay in cache.
First you have to know what's the architecture of the machine you want to run the code on. Then you should check it there's an instruction doing that kind of stuff.
Actually using the memory heavily will force the cache controller to put this region in cache.
And there are three rules of optimizing, you may want to know them first :)
http://c2.com/cgi/wiki?RulesOfOptimization

Dealing with large amounts of data in c++

I have an application that sometimes will utilize a large amount of data. The user has the option to load in a number of files which are used in a graphical display. If the user selects more data than the OS can handle, the application crashes pretty hard. On my test system, that number is about the 2 gigs of physical RAM.
What is a good way to handle this situation? I get the "bad alloc" thrown from new and tried trapping that but I still run into a crash. I feel as if I'm treading in nasty waters loading this much data but it is a requirement of this application to handle this sort of large data load.
Edit: I'm testing under a 32 bit Windows system for now but the application will run on various flavors of Windows, Sun and Linux, mostly 64 bit but some 32.
The error handling is not strong: It simply wraps the main instantiation code with a try catch block, the catch looking for any exception per another peer's complaint of not being able to trap the bad_alloc everytime.
I think you guys are right, I need a memory management system that doesn't load all of this data into the RAM, it just seems like it.
Edit2: Luther said it best. Thanks guy. For now, I just need a way to prevent a crash which with proper exception handling should be possible. But down the road I'll be implementing that acception solution.
There is the STXXL library which offers STL like containers for large Datasets.
http://stxxl.sourceforge.net/
Change "large" into "huge". It is designed and optimized for multicore processing of data sets that fit on terabyte-disks only. This might suffice for your problem, or the implementation could be a good starting point to tailor your own solution.
It is hard to say anything about your application crashing, because there are numerous hiccups involved when it comes to tight memory conditions: You could hit a hard address space limit (for example by default 32-bit Windows only has 2GB address space per user process, this can be changed, http://www.fmepedia.com/index.php/Category:Windows_3GB_Switch_FAQ ), or be eaten alive by the OOM killer ( Not a mythical beast:, see http://lwn.net/Articles/104179/ ).
What I'd suggest in any case to think about a way to keep the data on disk and treat the main memory as a kind of Level-4 cache for the data. For example if you have, say, blobs of data, then wrap these in a class which can transparently load the blobs from disk when they are needed and registers to some kind of memory manager which can ask some of the blob-holders to free up their memory before the memory conditions become unbearable. A buffer cache thus.
The user has the option to load in a number of files which are used in a graphical display.
Usual trick is not to load the data into memory directly, but rather use the memory mapping mechanism to make the files look like memory.
You need to make sure that the memory mapping is done in read-only mode to allow the OS to evict it from RAM if it is needed for something else.
If the user selects more data than the OS can handle, the application crashes pretty hard.
Depending on OS it is either: application is missing some memory allocation error handling or you really getting to the limit of available virtual memory.
Some OSs also have an administrative limit on how large the heap of application can grow.
On my test system, that number is about the 2 gigs of physical RAM.
It sounds like:
your application is 32-bits and
your OS uses the 2GB/2GB virtual memory split.
To avoid hitting the limit, your need to:
upgrade your app and OS to 64-bit or
tell OS (IIRC patch for Windows; most Linuxes already have it) to use 3GB/1GB virtual memory split. Some 32-bit OSs are using 2GB/2GB memory split: 2GB of virtual memory for kernel and 2 for the user application. 3/1 split means 1GB of VM for kernel, 3 for the user application.
How about maintaining a header table instead of loading the entire data. Load the actual page when the user requests the data.
Also use some data compression algorithms (like 7zip, znet etc.) which reduce the file size. (In my project they reduced the size from 200MB to 2MB)
I mention this because it was only briefly mentioned above, but it seems a "file paging system" could be a solution. These systems read large data sets in "chunks" by breaking the files into pieces. Once written, they generally "just work" and you hopefully won't have to tinker with them anymore.
Reading Large Files
Variable Length Data in File--Paging
New Link below with very good answer.
Handling Files greater than 2 GB
Search term: "file paging lang:C++" add large or above 2GB for more. HTH
Not sure if you are hitting it or not, but if you are using Linux, malloc will typically not fail, and operator new will typically not throw bad_alloc. This is because Linux will overcommit, and instead kill your process when it decides the system doesn't have enough memory, possibly at a page fault.
See: Google search for "oom killer".
You can disable this behavior with:
echo 2 > /proc/sys/vm/overcommit_memory
Upgrade to a 64-bit CPU, 64-bit OS and 64-bit compiler, and make sure you have plenty of RAM.
A 32-bit app is restricted to 2GB of memory (regardless of how much physical RAM you have). This is because a 32-bit pointer can address 2^32 bytes == 4GB of virtual memory. 20 years ago this seemed like a huge amount of memory, so the original OS designers allocated 2GB to the running application and reserved 2GB for use by the OS. There are various tricks you can do to access more than 2GB, but they're complex. It's probably easier to upgrade to 64-bit.

Staying away from virtual memory in Windows\C++

I'm writing a performance critical application where its essential to store as much data as possible in the physical memory before dumping to disc.
I can use ::GlobalMemoryStatusEx(...) and ::GetProcessMemoryInfo(...) to find out what percentage of physical memory is reserved\free and how much memory my current process handles.
Using this data I can make sure to dump when ~90% of the physical memory is in use or ~90 of the maximum of 2GB per application limit is hit.
However, I would like a method for simply recieving how many bytes are actually left before the system will start using the virtual memory, especially as the application will be compiled for both 32bit and 64bit, whereas the 2 GB limit doesnt exist.
How about this function:
int
bytesLeftUntilVMUsed() {
return 0;
}
it should give the correct result in nearly all cases I think ;)
Imagine running Windows 7 in 256Mb of RAM (MS suggest 1GB minimum). That's effectively what you're asking the user to do by wanting to reseve 90% of available RAM.
The real question is: Why do you need so much RAM? What is the 'performance critical' criteria exactly?
Usually, this kind of question implies there's something horribly wrong with your design.
Update:
Using top of the range RAM (DDR3) would give you a theoretical transfer speed of 12GB/s which equates to reading one 32 bit value every clock cycle with some bandwidth to spare. I'm fairly sure that it is not possible to do anything useful with the data coming into the CPU at that speed - instruction processing stalls would interrupt this flow. The extra, unsued bandwidth can be used to page data to/from a hard disk. Using RAID this transfer rate can be quite high (about 1/16th of the RAM bandwidth). So it would be feasible to transfer data to/from the disk and process it without having any degradation of performance - 16 cycles between reads is all it would take (OK, my maths might be a bit wrong here).
But if you throw Windows into the mix, it all goes to pot. Your memory can go away at any moment, your application can be paused arbitrarily and so on. Locking memory to RAM would have adverse affects on the whole system, thus defeating the purpose of locing the memory.
If you explain what you're trying to acheive and the performance critria, there are many people here that will help develop a suitable solution, because if you have to ask about system limits, you really are doing something wrong.
Even if you're able to stop your application from having memory paged out to disk, you'll still run into the problem that the VMM might be paging out other programs to disk and that might potentially affect your performance as well. Not to mention that another application might start up and consume memory that you're currently occupying and thus resulting in some of your applications memory being paged out. How are you planning to deal with that?
There is a way to use non-pageable memory via the non-paged pool but (a) this pool is comparatively small and (b) it's used by device drivers and might only be usable from inside the kernel. It's also not really recommended to use large chunks of it unless you want to make sure your system isn't that stable.
You might want to revisit the design of your application and try to work around the possibility of having memory paged to disk before you either try to write your own VMM or turn a Windows machine into essentially a DOS box with more memory.
The standard solution is to not worry about "virtual" and worry about "dynamic".
The "virtual" part of virtual memory has to be looked at as a hardware function that you can only defeat by writing your own OS.
The dynamic allocation of objects, however, is simply your application program's design.
Statically allocate simple arrays of the objects you'll need. Use those arrays of objects. Increase and decrease the size of those statically allocated arrays until you have performance problems.
Ouch. Non-paged pool (the amount of RAM which cannot be swapped or allocated to processes) is typically 256 MB. That's 12.5% of RAM on a 2GB machine. If another 90% of physical RAM would be allocated to a process, that leaves either -2,5% for all other applications, services, the kernel and drivers. Even if you'd allocate only 85% for your app, that would still leave only 2,5% = 51 MB.