Staying away from virtual memory in Windows\C++ - c++

I'm writing a performance critical application where its essential to store as much data as possible in the physical memory before dumping to disc.
I can use ::GlobalMemoryStatusEx(...) and ::GetProcessMemoryInfo(...) to find out what percentage of physical memory is reserved\free and how much memory my current process handles.
Using this data I can make sure to dump when ~90% of the physical memory is in use or ~90 of the maximum of 2GB per application limit is hit.
However, I would like a method for simply recieving how many bytes are actually left before the system will start using the virtual memory, especially as the application will be compiled for both 32bit and 64bit, whereas the 2 GB limit doesnt exist.

How about this function:
int
bytesLeftUntilVMUsed() {
return 0;
}
it should give the correct result in nearly all cases I think ;)

Imagine running Windows 7 in 256Mb of RAM (MS suggest 1GB minimum). That's effectively what you're asking the user to do by wanting to reseve 90% of available RAM.
The real question is: Why do you need so much RAM? What is the 'performance critical' criteria exactly?
Usually, this kind of question implies there's something horribly wrong with your design.
Update:
Using top of the range RAM (DDR3) would give you a theoretical transfer speed of 12GB/s which equates to reading one 32 bit value every clock cycle with some bandwidth to spare. I'm fairly sure that it is not possible to do anything useful with the data coming into the CPU at that speed - instruction processing stalls would interrupt this flow. The extra, unsued bandwidth can be used to page data to/from a hard disk. Using RAID this transfer rate can be quite high (about 1/16th of the RAM bandwidth). So it would be feasible to transfer data to/from the disk and process it without having any degradation of performance - 16 cycles between reads is all it would take (OK, my maths might be a bit wrong here).
But if you throw Windows into the mix, it all goes to pot. Your memory can go away at any moment, your application can be paused arbitrarily and so on. Locking memory to RAM would have adverse affects on the whole system, thus defeating the purpose of locing the memory.
If you explain what you're trying to acheive and the performance critria, there are many people here that will help develop a suitable solution, because if you have to ask about system limits, you really are doing something wrong.

Even if you're able to stop your application from having memory paged out to disk, you'll still run into the problem that the VMM might be paging out other programs to disk and that might potentially affect your performance as well. Not to mention that another application might start up and consume memory that you're currently occupying and thus resulting in some of your applications memory being paged out. How are you planning to deal with that?
There is a way to use non-pageable memory via the non-paged pool but (a) this pool is comparatively small and (b) it's used by device drivers and might only be usable from inside the kernel. It's also not really recommended to use large chunks of it unless you want to make sure your system isn't that stable.
You might want to revisit the design of your application and try to work around the possibility of having memory paged to disk before you either try to write your own VMM or turn a Windows machine into essentially a DOS box with more memory.

The standard solution is to not worry about "virtual" and worry about "dynamic".
The "virtual" part of virtual memory has to be looked at as a hardware function that you can only defeat by writing your own OS.
The dynamic allocation of objects, however, is simply your application program's design.
Statically allocate simple arrays of the objects you'll need. Use those arrays of objects. Increase and decrease the size of those statically allocated arrays until you have performance problems.

Ouch. Non-paged pool (the amount of RAM which cannot be swapped or allocated to processes) is typically 256 MB. That's 12.5% of RAM on a 2GB machine. If another 90% of physical RAM would be allocated to a process, that leaves either -2,5% for all other applications, services, the kernel and drivers. Even if you'd allocate only 85% for your app, that would still leave only 2,5% = 51 MB.

Related

Dumping buffered data when memory limit is near

My application buffers data for likely requests in the background. Currently I limit the size of the buffer based on a command-line parameter, and begin dumping less-used data when we hit this limit. This is not ideal because it relies on the user to specify a performance-critical parameter. Is there a better way to handle this? Is there a way to automatically monitor system memory use and dump the oldest/least-recently-used data before the system starts to thrash?
A complicating factor here is that my application runs on Linux, OSX, and Windows. But I'll take a good way to do this on only one platform over nothing.
Your best bet would likely be to monitor your applications working set/resident set size, and try to react when it doesn't grow after your allocations. Some pointers on what to look for:
Windows: GetProcessMemoryInfo
Linux: /proc/self/statm
OS X: task_info()
Windows also has GlobalMemoryStatusEx which gives you a nice Available Physical Memory figure.
I like your current solution. Letting the user decide is good. It's not obvious everyone would want the buffer to be as big as possible, is it? If you do invest in implemting some sort of memory monitor for automatically adjusting the buffer/cache size, at least let the user choose between the user set limit and the automatic/dynamic one.
I know this isn't a direct answer, but I'd say step back a bit and maybe don't do this.
Even if you have the API to see current physical memory usage, that's not enough to choose an ideal cache size. That would depend on your typical and future workloads for both the program and the machine (and the overall system of all clients running this program + the server(s) they're querying), the platform's caching behavior, whether the system should be tuned for throughput or latency, and so on. In a tight memory situation, you're going to be competing for memory with other opportunistic caches, including the OS's disk cache. On the one hand, you want to be exerting some pressure on them, to force out other low-value data. On the other hand, if you get greedy while there's plenty of memory, you're going to be affecting the behavior of other adaptive caches.
And with speculative caching/prefetching, the LRU value function is odd: you will (hopefully) fetch the most-likely-to-be-called data first, and less-likely data later, so the LRU data in your prefetch cache may be less valuable than older data. This could lead to perverse behavior in the systemwide set of caches by artificially "heating up" less commonly used data.
It seems unlikely that your program would be able to make a cache size choice better than a simple fixed size, perhaps scaled based on the size of overall physical memory on the machine. And there's very little chance it could beat a sysadmin who knows the machine's typical workload and its performance goals.
Using an adaptive cache sizing strategy means that your program's resource usage is going to be both variable and unpredictable. (With respect to both memory and the I/O and server requests used to populate that prefetch cache.) For a lot of server situations, that's not good. (Especially in HPC or DB servers, which this sounds like it might be for, or a high-utilization/high-throughput environment.) Consistency, configurability, and availability are often more important than maximum resource utilization. And locality of reference tends to fall off quickly, so you're likely getting very diminishing returns with larger cache sizes. If this is going to be used server-side, at least leave the option for explicit control of cache sizes, and probably make that the default, if not only, option.
There is a way: it is called virtual memory (vm). All three operating systems listed will use virtual memory (vm), unless there is no hardware support (which may be true in embedded systems). So I will assume that vm support is present.
Here is a quote from the architecture notes of the Varnish project:
The really short answer is that computers do not have two kinds of storage any more.
I would suggest you read the full text here: http://www.varnish-cache.org/trac/wiki/ArchitectNotes
It is a good read, and I believe will answer your question.
You could attempt to allocate some large-ish block of memory then check for a memory allocation exception. If the exception occurs, dump data. The problem is, this will only work when all system memory (or process limit) is reached. This means your application is likely to start swapping.
try {
char *buf = new char[10 * 1024 * 1024]; // 10 megabytes
free(buf);
} catch (const std::bad_alloc &) {
// Memory allocation failed - clean up old buffers
}
The problems with this approach are:
Running out of system memory can be dangerous and cause random applications to be shut down
Better memory management might be a better solution. If there is data that can be freed, why has it not already been freed? Is there a periodic process you could run to clean up unneeded data?

Memory mapped files performance - memory management when working with large data sets

I have a situation where I need to work with a number (15-30) of large (several hundreds mb) data structures. They won't fit into memory all at the same time. To make things worse, the algorithms operating on them work across all those structures, i.e. not first one, then the other etc. I need to make this as fast as possible.
So I figured I'd allocate memory on disk, in files that are basically direct binary representations of the data when it's loaded into memory, and use memory mapped files to access the data. I use mmap 'views' of for example 50 megabytes (50 mb of the files are loaded into memory at a time), so when I have 15 data sets, my process uses 750 mb of memory for the data. Which was OK initially (for testing), when I have more data I adjust the 50 mb down at the cost of some speed.
However this heuristic is hard-coded for now (I know the size of the data set I will test with). 'In the wild', my software will need to be able to determine the 'right' amount of memory to allocate to maximize performance. I could say 'I will target a memory use of 500 mb' and then divide 500 by the amount of data structures to come to a mmap view size. I have found that when trying to set this 'target memory usage' too high, that the virtual memory manager disk thrashing will (almost) lock up the machine and render it unusable until the processing finishes. This is to be avoided in my 'production' solution.
So my questions, all somewhat different approaches to the problem:
What is the 'best' target size for a single process? Should I just try to max out the 2gb that I have (assuming 32 bit Win XP and up, non-/3GB for now) or try to keep my process size smaller so that my software won't hog the machine? When I have 2 Visual Studio's, Outlook and a Firefox open on my machine, those use 1/2 gb of virtual memory easily by themselves - if I let my software use 2 gb of virtual memory the swapping will severely slow down the machine. But then how do I determine the 'best' process size.
What can I do to keep performance of the machine in check when working with memory-mapped files? My application does fairly simple numerical operations on the data, which basically means that it zips over hundreds of megabytes of data real quick, causing the whole memory-mapped files (several gigabytes) to be loaded into memory and swapped out again very quickly, again and again (think Monte Carlo style simulation).
Is there any chance that not using memory-mapped files and just using fseek/fgets is going to be faster or less intrusive than using memory mapped files?
Any articles, papers or books I can read about this? Either with 'cookbook' style solutions or fundamental concepts.
Thanks.
It occurs to me that you could set some predefined threshold for "too darn slow" and use the computer's wall-clock to make your alterations on the fly.
Start conservatively low. If this is below your "too darn slow" threshold, bump the size up a little bit for the next file. do this iteratively. When you go above the threshold, slowly back the size off iteratively.
I think it's a good place to try Address Windowing Extensions: http://msdn.microsoft.com/en-us/library/aa366527(v=VS.85).aspx
It will allow to use more than 4GB of memory by providing a sliding window. The drawback is that not all versions of windows have it.
I probably wouldn't use a memory-mapped file for this app. Memory-mapped files work best when you have a large virtual address space (at least relative to the size of the data you're processing). You map the entire file, and let the OS decide which pieces remain resident.
However, if you're repeatedly mapping and unmapping segments of the file (rather than the entire file), you'll probably end up doing just as well by reading chunks via fseek and fread -- note, however, that you do not want to read individual pieces of data this way (ie, do one large read rather than a lot of small reads).
The one way that manually segmented memory-mapped files might win is if you have sparse reads: if you'll only be touching, say 10% of a given file. In this case, memory mapping means the OS will read only those pages that are touched, whereas explicit reads will load the entire file.
Oh, and I would definitely not spend time trying to control my resource consumption. The OS will do that better than you can, because it knows about all competing processes.
It will probably be best to fix the size of the memory mapped file to be a some percentage of the total system memory with probably a set minimum.
Remember that the operating system will effectively load a whole memory page when you access a single byte, this may well happen in the background but will only be fast if sequential data accesses tend to be close together.
You should therefore try to keep sequential accesses to your data as close together in memory/the file as possible. You can also look a preloading strategies access your data speculatively before actually requiring the data. These are the same considerations that you will need when optimizing for memory cache efficiency.
If sequential data accesses are scattered widely in your file, you may be better off using fseek and fread to access the data since this will give you better fine-grain control of what data is written to memory when.
Also remember that there are no hard and fast rules. Optimizations can sometimes be counter-intuitive so try a whole bunch of different things and see which works best on the platform that this will need to operate on.
Perhaps you can use /LARGEADDRESSAWARE for you linker of Visual Studio, and use bcdedit for your process to use memory larger than 2GB.

How to write a C or C++ program to act as a memory and CPU cycle filler?

I want to test a program's memory management capabilities, for example (say, program name is director)
What happens if some other processes take up too much memory, and there is too less memory for director to run? How does director behave?
What happens if too many of the CPU cycles are used by some other program while director is running?
What happens if memory used by the other programs is freed after sometime? How does director claim the memory and start working at full capabilities.
etc.
I'll be doing these experiments on a Unix machine. One way is to limit the amount of memory available to the process using ulimit, but there is no good way to have control over the CPU cycle utilization.
I have another idea. What if I write some program in C or C++ that acts as a dynamic memory and CPU filler, i.e. does nothing useful but eats up memory and/or CPU cycles anyways?
I need some ideas on how such a program should be structured. I need to have dynamic(runtime) control over memory used and CPU used.
I think that creating a lot of threads would be a good way to clog up the CPU cycles. Is that right?
Is there a better approach that I can use?
Any ideas/suggestions/comments are welcome.
http://weather.ou.edu/~apw/projects/stress/
Stress is a deliberately simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system. It is written in C, and is free software licensed under the GPLv2.
The functionality you seek overlaps the feature set of "test tools". So also check out http://ltp.sourceforge.net/tooltable.php.
If you have a single core this is enough to put stress on a CPU:
while ( true ) {
x++;
}
If you have lots of cores then you need a thread per core.
You make it variably hungry by adding a few sleeps.
As for memory, just allocate lots.
There are several problems with such a design:
In a virtual memory system, memory size is effectively unlimited. (Well, it's limited by your hard disk...) In practice, systems usually run out of address space well before they run out of backing store -- and address space is a per-process resource.
Any reasonable (non realtime) operating system is going to limit how much CPU time and memory your process can use relative to other processes.
It's already been done.
More importantly, I don't see why you would ever want to do this.
Dynamic memory control, you could just allocate or free buffers of a certain size to use or free more or less memory. As for CPU utilization, you will have to get an OS function to check this and periodically check it and see if you need to do useful work.

Dealing with large amounts of data in c++

I have an application that sometimes will utilize a large amount of data. The user has the option to load in a number of files which are used in a graphical display. If the user selects more data than the OS can handle, the application crashes pretty hard. On my test system, that number is about the 2 gigs of physical RAM.
What is a good way to handle this situation? I get the "bad alloc" thrown from new and tried trapping that but I still run into a crash. I feel as if I'm treading in nasty waters loading this much data but it is a requirement of this application to handle this sort of large data load.
Edit: I'm testing under a 32 bit Windows system for now but the application will run on various flavors of Windows, Sun and Linux, mostly 64 bit but some 32.
The error handling is not strong: It simply wraps the main instantiation code with a try catch block, the catch looking for any exception per another peer's complaint of not being able to trap the bad_alloc everytime.
I think you guys are right, I need a memory management system that doesn't load all of this data into the RAM, it just seems like it.
Edit2: Luther said it best. Thanks guy. For now, I just need a way to prevent a crash which with proper exception handling should be possible. But down the road I'll be implementing that acception solution.
There is the STXXL library which offers STL like containers for large Datasets.
http://stxxl.sourceforge.net/
Change "large" into "huge". It is designed and optimized for multicore processing of data sets that fit on terabyte-disks only. This might suffice for your problem, or the implementation could be a good starting point to tailor your own solution.
It is hard to say anything about your application crashing, because there are numerous hiccups involved when it comes to tight memory conditions: You could hit a hard address space limit (for example by default 32-bit Windows only has 2GB address space per user process, this can be changed, http://www.fmepedia.com/index.php/Category:Windows_3GB_Switch_FAQ ), or be eaten alive by the OOM killer ( Not a mythical beast:, see http://lwn.net/Articles/104179/ ).
What I'd suggest in any case to think about a way to keep the data on disk and treat the main memory as a kind of Level-4 cache for the data. For example if you have, say, blobs of data, then wrap these in a class which can transparently load the blobs from disk when they are needed and registers to some kind of memory manager which can ask some of the blob-holders to free up their memory before the memory conditions become unbearable. A buffer cache thus.
The user has the option to load in a number of files which are used in a graphical display.
Usual trick is not to load the data into memory directly, but rather use the memory mapping mechanism to make the files look like memory.
You need to make sure that the memory mapping is done in read-only mode to allow the OS to evict it from RAM if it is needed for something else.
If the user selects more data than the OS can handle, the application crashes pretty hard.
Depending on OS it is either: application is missing some memory allocation error handling or you really getting to the limit of available virtual memory.
Some OSs also have an administrative limit on how large the heap of application can grow.
On my test system, that number is about the 2 gigs of physical RAM.
It sounds like:
your application is 32-bits and
your OS uses the 2GB/2GB virtual memory split.
To avoid hitting the limit, your need to:
upgrade your app and OS to 64-bit or
tell OS (IIRC patch for Windows; most Linuxes already have it) to use 3GB/1GB virtual memory split. Some 32-bit OSs are using 2GB/2GB memory split: 2GB of virtual memory for kernel and 2 for the user application. 3/1 split means 1GB of VM for kernel, 3 for the user application.
How about maintaining a header table instead of loading the entire data. Load the actual page when the user requests the data.
Also use some data compression algorithms (like 7zip, znet etc.) which reduce the file size. (In my project they reduced the size from 200MB to 2MB)
I mention this because it was only briefly mentioned above, but it seems a "file paging system" could be a solution. These systems read large data sets in "chunks" by breaking the files into pieces. Once written, they generally "just work" and you hopefully won't have to tinker with them anymore.
Reading Large Files
Variable Length Data in File--Paging
New Link below with very good answer.
Handling Files greater than 2 GB
Search term: "file paging lang:C++" add large or above 2GB for more. HTH
Not sure if you are hitting it or not, but if you are using Linux, malloc will typically not fail, and operator new will typically not throw bad_alloc. This is because Linux will overcommit, and instead kill your process when it decides the system doesn't have enough memory, possibly at a page fault.
See: Google search for "oom killer".
You can disable this behavior with:
echo 2 > /proc/sys/vm/overcommit_memory
Upgrade to a 64-bit CPU, 64-bit OS and 64-bit compiler, and make sure you have plenty of RAM.
A 32-bit app is restricted to 2GB of memory (regardless of how much physical RAM you have). This is because a 32-bit pointer can address 2^32 bytes == 4GB of virtual memory. 20 years ago this seemed like a huge amount of memory, so the original OS designers allocated 2GB to the running application and reserved 2GB for use by the OS. There are various tricks you can do to access more than 2GB, but they're complex. It's probably easier to upgrade to 64-bit.

Which one is faster, reading from disk or allocate system memory

My environment is XP 32-bit. I find when allocated memory is nearly the maximum size, 2GB, that means a little virtual space is available, allocationnew memory is very slow.
So if I have a page file, my app need to analyze them.
I have two ways. One is to read them all into system memory, then do the analysis.
The other is to reserv a memory buffer first as a cache, and read part of page file into that buffer, analyze and then discard it, then read second part of page file, and override the cache, do the analysis again.
From the profiling, it looks the second one is faster, since it avoid the allocation time cost.
What do you think? Thanks in adavance.
(1) I'm not sure the question matches the title. If you're allocating close to 2GB of RAM on 32 bit Windows, the system is probably paging a lot of memory to disk, and that's where I'd look first for the slow down. When you're using a lot of memory, you should think of it as being stored on disk (in pagefile.sys) but cached in physical RAM. The second one might be faster not because of the cost of doing allocation, but because of the cost of using a lot of memory at once. In effect when you copy the file into one big allocation you're copying much of it disk->disk via RAM, then when you run over it again to analyse, you're loading the copy back to RAM again. If your analysis is a single-pass algorithm that's a lot of redundant work.
(2) What I think is, mmap the file (MapViewOfFile and friends on Windows).
Edit: (3) a caution. If the file is currently 1.8GB, there might be a chance that next year it might be 4GB. If so, I'd plan now for it to have a size greater than 2^32 on a 32bit machine, which means either taking your second option, or else still using MapViewOfFile but doing it one sensible-sized chunk of the file at a time, rather than all at once. Otherwise you'll be revisiting this code the first time someone tries it on a big file and reports the bug.
You forget 3d way - to map memory onto file, see function CreateFileMapping/MapViewOfFile
This is most fast way
You best bet is to use the windows MapViewOfFile and similar functions (the Windows equivalent of mmap). This will allow the operating system to manage the paging in of various parts of the file.
Why is the amount allocated memory so high? If memory allocations take a reasonable amount of time then you will find doing it in memory is far far quicker - my approach would be to do it in memory, and try to find a way to reduce the memory usage to the point where its quick again.
As I see the situation, you either manage the paging yourself or let the operating system manage the paging for you. In most cases I would suggest letting the operating system handle the paging (use virtual memory). Since I have a distrust of MS operating systems, I cannnot recommend this technique, although your mileage may vary.