As far as I understand the CCSpriteFrameCache gets filled when creating sprites from files and spritesheets and should be clever enough to purge unused frames when getting out of memory.
My question is: in my app should I worry about manually releasing unused frames as soon as possible or should I just rely on the cache to purge itself when necessary?
Is there any side-effect (like overall performance issue of the system or of other running apps) in letting the cache grow until getting memory warning?
The answer to that truly depends on your app, and its memory footprint. In one of my games, I aggressively purge memory, and load 'just in time', when the game is in a circumstance where the reload will not add undue lags that could be a turn-off for the user.
But as I said, your strategy should be based on actual measurements of the memory footprint, and its 'perceived' impact on user experience. As always, start your measurements on the simulator (OK for measuring memory but NOTOK for measuring time and FPS), but quickly validate with some measurements on real devices.
Related
I'm a recent convert to C++ for game programming - I have a lot of experience dealing with memory management and garbage collection woes in C#, but not as much with C++.
I've heard some vague advice in the past to avoid allocations and deallocations during gameplay (i.e. new and delete) and to pre-allocate everything you might need up front. But that's a lot more tedious and architecturally complex than just allocating and freeing game objects as needed while the game's running (enemies, particles, etc.).
I think the advice I read was referring to resource-constrained platforms - I'm aiming to develop mostly for PC, and I imagine the game state data that would be changing frequently would be on the order of a few megabytes at most. The rest are textures, sound assets, etc. that I'll be preloading.
So my question is: in a world of PCs with gigabytes of memory, is it worth the headache to set up elaborate memory pooling, pre-allocation, and so forth for my game state data? Or is this just some unquestioned "best practice" tradition that evolved when maxing out limited platforms, that is now repeated as gospel?
If my 2 MB of game data gets fragmented and is now spread over 4MB, I can't imagine that mattering in the slightest on a PC made after 1990 - but curious to know if I'm missing something :).
The main reasons to avoid calling new unless necessary in a game environment are that
Dynamic allocation of memory really is surprisingly expensive.
Cache misses are detrimental to performance.
Dynamic Allocation
At my work, we develop a game-like product (virtual surgery) and most of our memory is pre-allocated and handled via factories or memory pools. This is done because, dynamically allocating memory takes a very long time. The system has to deal with memory requests of many different sizes at any and all times. This means there's a lot of work going into processes such as minimizing fragmentation. If you ask the system for memory, you'll have to wait for it to do these things. If you pre-allocate memory, you can use factories or other block-size-specific memory managers to alleviate these concerns.
I can tell you from experience that a simple mistake like allocating a reasonably large std::vector from scratch every frame, instead of reusing pre-allocated memory, can drag the frame rate down into the gutter.
Cache Misses
Another related issue is cache coherence. Cache misses, which force the OS to bring a new page into the cache, are also very expensive. If this happens often, you'll have an unplayable game. If, however, you pre-allocate large chunks of memory, this can go a long way towards improving cache locality, which makes cache misses few and far between.
Moral of the story
So, in short: If you don't manage your own pre-allocated memory, you can expect a lot of your computational time to be lost to waiting for the system to allocate memory or handle cache misses.
in a world of PCs with gigabytes of memory
And only a few megabytes for your process if a sane OS is running on that PC. (But that's not the primary issue here anyway.)
is it worth the headache to set up elaborate memory pooling, pre-allocation, and so forth for my game state data
I don't dare saying that "always preallocate everything", because sometimes it's just not possible, and for less frequently used objects, it might not worth the effort, but it is certainly true that dynamic memory management in C and C++ is resource-intensive and slow. So if you render, for example 30 frames a second to the screen, then possibly avoid allocating and de-allocating and then re-allocating the buffer for the objects to be rendered during each frame/iteration.
If I use the following call in C++, I would expect the WorkingSet of the process to never drop below 100MB.
However, the OS still trims the working set back to 16MB, even if I make this call.
Setting WorkingSet to 100MB would lead to a dramatic increase in my application speed, by eliminating soft page page faults (see the diagram below).
What am I doing wrong?
SIZE_T workingSetSizeMB = 100;
int errorCode = SetProcessWorkingSetSizeEx(
GetCurrentProcess(),
(workingSetSizeMB - 1) * 1024 * 1024), // dwMinimumWorkingSetSize
workingSetSizeMB * 1024 * 1024, // dwMaximumWorkingSetSize,
QUOTA_LIMITS_HARDWS_MIN_ENABLE | QUOTA_LIMITS_HARDWS_MAX_DISABLE
);
// errorCode returns 1, so the call worked.
(extra for experts) Experimental Methodology
I wrote a test C++ project to allocate 100MB of data to bring the WorkingSet over 100MB (as viewed within Process Explorer), then deallocated that memory. However, the OS trimed the WorkingSet back to 16MB as soon as I deallocated that memory. I can provide the test C++ project I used if you wish.
Why is Windows providing a call to SetProcessWorkingSetSizeEx() if it doesn't appear to work? I must be doing something wrong.
The diagram below shows the dramatic increase in the number of soft page faults (the red spikes) when the green line (the working set) dropped from 50MB to 30MB.
Update
In the end, we ended up ignoring the problem, as it didn't impact performance that much.
More importantly, SetProcessWorkingSetSizeEx does not control the current WorkingSet, and is not related in any way to soft page faults. All it does is prevent hard page faults, by preventing the current WorkingSet being paged out to the hard drive.
In other words, if one wants to reduce soft page faults, SetProcessWorkingSetSizeEx has absolutely no effect, as it refers to hard page faults.
There is a great writeup in "Windows via C/C++" (Richter) which how Windows deals with memory.
Page faults are cheap and are to be expected. Real-time applications, high-end games, high-intensity processing and BluRay playback all happily work at full-speed with page-faults. Page faults are not the reason your application is slow.
To find out why your application is slow, you need to do some application profiling of your application.
To specifically answer your question - the page faults that are occurring when you've just had a GC.Collect() aren't page-in faults, they're demand-zeroed page faults caused by the fact that the GC has just allocated a new huge block of demand-zeroed pages to move your objects to. Demand zero pages aren't serviced from your pagefile and incur no disk cost, but they are still page-faults, hence why they show on your graph.
As a general rule, Windows is better at managing your system resources than you are, and it's defaults are highly tuned for the average case of normal programs. It is quite clear from your example that you are using a garbage collector, and hence you've already offloaded the task of dealing with working sets and virtual memory and so on to the GC implementation. If SetProcessWorkingSetSize was a good API call to improve GC performance, the GC implementation would do it.
My advice to you is to profile your app. The main cause of slowdown in managed applications is writing bad managed code - not the GC slowing you down. Improve the big-O performance of your algorithms, offload expensive work through the use of things like Future and BackgroundWorker and try to avoid doing synchronous requests to the network - but above all, the key to getting your app fast is to profile it.
My application buffers data for likely requests in the background. Currently I limit the size of the buffer based on a command-line parameter, and begin dumping less-used data when we hit this limit. This is not ideal because it relies on the user to specify a performance-critical parameter. Is there a better way to handle this? Is there a way to automatically monitor system memory use and dump the oldest/least-recently-used data before the system starts to thrash?
A complicating factor here is that my application runs on Linux, OSX, and Windows. But I'll take a good way to do this on only one platform over nothing.
Your best bet would likely be to monitor your applications working set/resident set size, and try to react when it doesn't grow after your allocations. Some pointers on what to look for:
Windows: GetProcessMemoryInfo
Linux: /proc/self/statm
OS X: task_info()
Windows also has GlobalMemoryStatusEx which gives you a nice Available Physical Memory figure.
I like your current solution. Letting the user decide is good. It's not obvious everyone would want the buffer to be as big as possible, is it? If you do invest in implemting some sort of memory monitor for automatically adjusting the buffer/cache size, at least let the user choose between the user set limit and the automatic/dynamic one.
I know this isn't a direct answer, but I'd say step back a bit and maybe don't do this.
Even if you have the API to see current physical memory usage, that's not enough to choose an ideal cache size. That would depend on your typical and future workloads for both the program and the machine (and the overall system of all clients running this program + the server(s) they're querying), the platform's caching behavior, whether the system should be tuned for throughput or latency, and so on. In a tight memory situation, you're going to be competing for memory with other opportunistic caches, including the OS's disk cache. On the one hand, you want to be exerting some pressure on them, to force out other low-value data. On the other hand, if you get greedy while there's plenty of memory, you're going to be affecting the behavior of other adaptive caches.
And with speculative caching/prefetching, the LRU value function is odd: you will (hopefully) fetch the most-likely-to-be-called data first, and less-likely data later, so the LRU data in your prefetch cache may be less valuable than older data. This could lead to perverse behavior in the systemwide set of caches by artificially "heating up" less commonly used data.
It seems unlikely that your program would be able to make a cache size choice better than a simple fixed size, perhaps scaled based on the size of overall physical memory on the machine. And there's very little chance it could beat a sysadmin who knows the machine's typical workload and its performance goals.
Using an adaptive cache sizing strategy means that your program's resource usage is going to be both variable and unpredictable. (With respect to both memory and the I/O and server requests used to populate that prefetch cache.) For a lot of server situations, that's not good. (Especially in HPC or DB servers, which this sounds like it might be for, or a high-utilization/high-throughput environment.) Consistency, configurability, and availability are often more important than maximum resource utilization. And locality of reference tends to fall off quickly, so you're likely getting very diminishing returns with larger cache sizes. If this is going to be used server-side, at least leave the option for explicit control of cache sizes, and probably make that the default, if not only, option.
There is a way: it is called virtual memory (vm). All three operating systems listed will use virtual memory (vm), unless there is no hardware support (which may be true in embedded systems). So I will assume that vm support is present.
Here is a quote from the architecture notes of the Varnish project:
The really short answer is that computers do not have two kinds of storage any more.
I would suggest you read the full text here: http://www.varnish-cache.org/trac/wiki/ArchitectNotes
It is a good read, and I believe will answer your question.
You could attempt to allocate some large-ish block of memory then check for a memory allocation exception. If the exception occurs, dump data. The problem is, this will only work when all system memory (or process limit) is reached. This means your application is likely to start swapping.
try {
char *buf = new char[10 * 1024 * 1024]; // 10 megabytes
free(buf);
} catch (const std::bad_alloc &) {
// Memory allocation failed - clean up old buffers
}
The problems with this approach are:
Running out of system memory can be dangerous and cause random applications to be shut down
Better memory management might be a better solution. If there is data that can be freed, why has it not already been freed? Is there a periodic process you could run to clean up unneeded data?
I'm working on what is essentially the runtime for a large administrative application. The actual logic that is being executed, as well as the screens being shown and the data operated upon is stored in a central database. In order to improve performance, the runtime keeps data queried from the database in various caches.
However, it is not always clear how these caches should be managed. Currently, some caches are flushed whenever the runtime goes idle, whereas other caches are never flushed, or only flushed if some configurable but arbitrary limit is reached. We'd obviously want to keep as much data as possible in memory, yet I'm unsure how to do this in a way that plays nicely with Citrix, something that's very important to our customers.
I've been looking into using a resource notification (CreateMemoryResourceNotification()) and flushing caches if it signals that memory is running low, but I'm afraid that using just that would make things behave very badly when running 20+ instances under Citrix, with one instance gobbling up all memory and the rest constantly flushing their caches.
I could set hard limits on cache size with CreateJobObject() and friends, but that could cause the runtime to fail with out-of-memory errors should an instance have a legitimate need for a lot of memory.
I could prevent such problems by using a separate heap for cached data, but there's not a clear separation between cached and non-cached data, so that seems awfully fragile.
TL;DR: anyone got any good ideas for managing in-memory caches under Windows?
I've been looking into using a resource notification (CreateMemoryResourceNotification()) and flushing caches if it signals that memory is running low, but I'm afraid that using just that would make things behave very badly when running 20+ instances under Citrix, with one instance gobbling up all memory and the rest constantly flushing their caches.
I could set hard limits on cache size with CreateJobObject() and friends, but that could cause the runtime to fail with out-of-memory errors should an instance have a legitimate need for a lot of memory.
Can't you make a hybrid solution of some kind, so that the runtime tries to keep its cache limited to a fixed size, but with the possibility to grow bigger if there is a legitimate need to do so and then try to shrink the cache to a reasonable size if the occasion is there?
Preventing one instance from gobbling up all memory while the others are repeatedly flushing their caches can maybe avoided by distributing the memory resource notification to all instances when it arrives. This way they all take a good look at their caches when one instance gets the notification.
And last, of course sometimes a trade-off between performance and memory usage has to be made. Here again, if the instances can communicate in some way, they may be able to adjust their maximum cache size based on the number of instances and the amount of memory available on the machine they run on. This way, if more instances are started, they all give in a little bit to accommodate the newcomer, without the risk of overloading the memory of the server.
What strategy are you going to use to determine what needs to be cached? Are you going to keep a last-used timestamp and flushing old items when room needs to be made for new ones?
I'm writing a performance critical application where its essential to store as much data as possible in the physical memory before dumping to disc.
I can use ::GlobalMemoryStatusEx(...) and ::GetProcessMemoryInfo(...) to find out what percentage of physical memory is reserved\free and how much memory my current process handles.
Using this data I can make sure to dump when ~90% of the physical memory is in use or ~90 of the maximum of 2GB per application limit is hit.
However, I would like a method for simply recieving how many bytes are actually left before the system will start using the virtual memory, especially as the application will be compiled for both 32bit and 64bit, whereas the 2 GB limit doesnt exist.
How about this function:
int
bytesLeftUntilVMUsed() {
return 0;
}
it should give the correct result in nearly all cases I think ;)
Imagine running Windows 7 in 256Mb of RAM (MS suggest 1GB minimum). That's effectively what you're asking the user to do by wanting to reseve 90% of available RAM.
The real question is: Why do you need so much RAM? What is the 'performance critical' criteria exactly?
Usually, this kind of question implies there's something horribly wrong with your design.
Update:
Using top of the range RAM (DDR3) would give you a theoretical transfer speed of 12GB/s which equates to reading one 32 bit value every clock cycle with some bandwidth to spare. I'm fairly sure that it is not possible to do anything useful with the data coming into the CPU at that speed - instruction processing stalls would interrupt this flow. The extra, unsued bandwidth can be used to page data to/from a hard disk. Using RAID this transfer rate can be quite high (about 1/16th of the RAM bandwidth). So it would be feasible to transfer data to/from the disk and process it without having any degradation of performance - 16 cycles between reads is all it would take (OK, my maths might be a bit wrong here).
But if you throw Windows into the mix, it all goes to pot. Your memory can go away at any moment, your application can be paused arbitrarily and so on. Locking memory to RAM would have adverse affects on the whole system, thus defeating the purpose of locing the memory.
If you explain what you're trying to acheive and the performance critria, there are many people here that will help develop a suitable solution, because if you have to ask about system limits, you really are doing something wrong.
Even if you're able to stop your application from having memory paged out to disk, you'll still run into the problem that the VMM might be paging out other programs to disk and that might potentially affect your performance as well. Not to mention that another application might start up and consume memory that you're currently occupying and thus resulting in some of your applications memory being paged out. How are you planning to deal with that?
There is a way to use non-pageable memory via the non-paged pool but (a) this pool is comparatively small and (b) it's used by device drivers and might only be usable from inside the kernel. It's also not really recommended to use large chunks of it unless you want to make sure your system isn't that stable.
You might want to revisit the design of your application and try to work around the possibility of having memory paged to disk before you either try to write your own VMM or turn a Windows machine into essentially a DOS box with more memory.
The standard solution is to not worry about "virtual" and worry about "dynamic".
The "virtual" part of virtual memory has to be looked at as a hardware function that you can only defeat by writing your own OS.
The dynamic allocation of objects, however, is simply your application program's design.
Statically allocate simple arrays of the objects you'll need. Use those arrays of objects. Increase and decrease the size of those statically allocated arrays until you have performance problems.
Ouch. Non-paged pool (the amount of RAM which cannot be swapped or allocated to processes) is typically 256 MB. That's 12.5% of RAM on a 2GB machine. If another 90% of physical RAM would be allocated to a process, that leaves either -2,5% for all other applications, services, the kernel and drivers. Even if you'd allocate only 85% for your app, that would still leave only 2,5% = 51 MB.