Caching strategies for Windows end-user applications? - c++

I'm working on what is essentially the runtime for a large administrative application. The actual logic that is being executed, as well as the screens being shown and the data operated upon is stored in a central database. In order to improve performance, the runtime keeps data queried from the database in various caches.
However, it is not always clear how these caches should be managed. Currently, some caches are flushed whenever the runtime goes idle, whereas other caches are never flushed, or only flushed if some configurable but arbitrary limit is reached. We'd obviously want to keep as much data as possible in memory, yet I'm unsure how to do this in a way that plays nicely with Citrix, something that's very important to our customers.
I've been looking into using a resource notification (CreateMemoryResourceNotification()) and flushing caches if it signals that memory is running low, but I'm afraid that using just that would make things behave very badly when running 20+ instances under Citrix, with one instance gobbling up all memory and the rest constantly flushing their caches.
I could set hard limits on cache size with CreateJobObject() and friends, but that could cause the runtime to fail with out-of-memory errors should an instance have a legitimate need for a lot of memory.
I could prevent such problems by using a separate heap for cached data, but there's not a clear separation between cached and non-cached data, so that seems awfully fragile.
TL;DR: anyone got any good ideas for managing in-memory caches under Windows?

I've been looking into using a resource notification (CreateMemoryResourceNotification()) and flushing caches if it signals that memory is running low, but I'm afraid that using just that would make things behave very badly when running 20+ instances under Citrix, with one instance gobbling up all memory and the rest constantly flushing their caches.
I could set hard limits on cache size with CreateJobObject() and friends, but that could cause the runtime to fail with out-of-memory errors should an instance have a legitimate need for a lot of memory.
Can't you make a hybrid solution of some kind, so that the runtime tries to keep its cache limited to a fixed size, but with the possibility to grow bigger if there is a legitimate need to do so and then try to shrink the cache to a reasonable size if the occasion is there?
Preventing one instance from gobbling up all memory while the others are repeatedly flushing their caches can maybe avoided by distributing the memory resource notification to all instances when it arrives. This way they all take a good look at their caches when one instance gets the notification.
And last, of course sometimes a trade-off between performance and memory usage has to be made. Here again, if the instances can communicate in some way, they may be able to adjust their maximum cache size based on the number of instances and the amount of memory available on the machine they run on. This way, if more instances are started, they all give in a little bit to accommodate the newcomer, without the risk of overloading the memory of the server.
What strategy are you going to use to determine what needs to be cached? Are you going to keep a last-used timestamp and flushing old items when room needs to be made for new ones?

Related

Preloading data into RAM for fast transaction

My thinking is that, if we preload clients' data(account no, netbalance) in advance, and whenever a transaction is processed the txn record is written into RAM in FIFO data structure, and update also the clients' data in RAM, then after a certain period the record will be written into database in disk to prevent data lost from RAM due to volatility.
By doing so the time in I/O should be saved and hance less time for seeking clients' data for the aim (faster transaction).
I have heard about in-memory database but I do not know if my idea is same as that thing. Also, is there better idea than what I am thinking?
In my opinion, there are several aspects to think about / research to get a step forward. Pre-Loading and Working on data is usually faster than being bound to disk / database page access schemata. However, you are instantly loosing durability. Therefore, three approaches are valid in different situations:
disk-synchronous (good old database way, after each transaction data is guaranteed to be in permanent storage)
in-memory (good as long as the system is up and running, faster by orders of magnitude, risk of loosing transaction data on errors)
delayed (basically in-memory, but from time to time data is flushed to disk)
It is worth noting that delayed is directly supported on Linux through Memory-Mapped files, which are - on the one hand side - often as fast as usual memory (unless reading and accessing too many pages) and on the other hand synced to disk automatically (but not instantly).
As you tagged C++, this is possibly the simplest way of getting your idea running.
Note, however, that when assuming failures (hardware, reboot, etc.) you won't have transactions at all, because it is non-trivial to concretely tell, when the data is actually written.
As a side note: Sometimes, this problem is solved by writing (reliably) to a log file (sequential access, therefore faster than directly to the data files). Search for the word Compaction in the context of databases: This is the operation to merge a log with the usually used on-disk data structures and happens from time to time (when the log gets too large).
To the last aspect of the question: Yes, in-memory databases work in main memory. Still, depending on the guarantees (ACID?) they give, some operations still involve hard disk or NVRAM.

Is it good idea to store operational data in memcached?

I write data processor on cpp, which should process a lot of requests and do a lot of calculations, requests are connected with each other. Now I think about easy horizontal scalability.
Is it a good idea to use memcached with replication (an instance on every processor) to store operational data? Such every processor instance could process every requests in an equal time.
How fast and stable is memcached replication?
very fast, one major potential shortcoming of memcached is that it is not persistent. While a common design consideration when using a cache layer is that “data in cache may go away at any point”, this can result in painful warmup time and/or costly cache stampedes.
I would check out Couchbase. http://www.couchbase.com/ It stores the cached data in RAM, but also flushes it out to disk periodically so if a machine gets restarted, the data is still there.
It's very easy to add nodes on the fly as well.
Just for fun you could also check out Riak: http://basho.com/riak/. Very easy to add nodes as your cache needs grow and very easy to get up and running. Also focused on key/value storage, which is good for caching objects.

Making more efficient use fork() and copy-on-write memory sharing

I am a programmer developing a multiplayer online game using Linux based servers. We use an "instanced" architecture for our world. This means that each player entering a world area gets a copy of that area to play in with their party members, and independent of all the other players playing in the same area.
Internally we use a separate process for each instance. Initially each instance process would start up, load only the resources required for the given area, generate it's random terrain, and then allow new connections from players. The amount of memory used by an instance was typically about 25 meg including resources and the randomly generated level with entities.
In order to reduce the memory footprint of instances, and to speed up the spawn time, we changed to an approach where we create a single master instance that loads all the resources that any instance could possibly need (about 150 meg of memory) and then when a new instance is required, use the fork() function to spawn a new instance and utilise copy-on-write memory sharing so that the new instance only requires memory for it's "unique" data set. The footprint of the randomly generated level and entities which make up the unique data for each instance is about 3-4 meg of memory.
Unfortunately the memory sharing is not working as well as I think it could. A lot of memory pages seem to become unshared.
At first, as we load more of our data set in the prefork instance, the memory required for each forked instance goes down, but eventually there is an inflection point where loading more assets in the prefork actually increases the data used by each forked instance.
The best results we have had are loading about 80 meg of the data set pre fork, and then having the fresh instances demand load the rest. This results in about 7-10 extra meg per instance and an 80 meg fixed cost. Certainly a good improvement, but not the theoretical best.
If I load the entire 150 meg data set and then fork, each forked instance uses about 50 more meg of memory! Significantly worse than simply doing nothing.
My question is, how can I load all of my data set in the prefork instance, and make sure I'm only getting the minimum set of really unique per instance data as the memory footprint for each instance.
I have a theory as to what is happening here and I was wondering if someone would be able to help confirm for me that this is the case.
I think it's to do with the malloc free chain. Each memory page of the prefork instance probably has a few free spots of memory left in it. If, during random level generation, something is allocated that happens to fit in one of the free spots in a page, then that entire page would be copied into the forked process.
In windows you can create alternate heaps, and change the default heap used by the process. If this were possible, it would remove the problem. Is there any way to do such a thing in linux? My investigations seem to indicate that you can't.
Another possible solution would be if I could somehow discard the existing malloc free chain, forcing malloc to allocate fresh memory from the operating system for subsequent calls. I attempted to look at the implementation of malloc to see if this would be easily possible, but it seemed like it might be somewhat complex. If anyone has any ideas around this area or a suggestion of where to start with this approach, I'd love to hear it.
And finally if anyone has any other ideas about what might be going wrong here, I'd really like to hear them. Thanks a lot!
In windows you can create alternate heaps, and change the default heap
used by the process. If this were possible, it would remove the
problem. Is there any way to do such a thing in linux?
I Unix you can simply mmap(2) memory an bypass malloc altogether.
I would also ditch the whole "rely-on-cow" thing. I would have the master process mmap some memory (80M, 150M whatever), write stuff to it, mark it read-only via mprotect(2) for good measure and take it from there. This would solve the real issue and wouldn't force you to change the code down the road.

memory safety for encrypted, sensitive data

im writing a server in c++ that will handle safe connections where sensitive data will be sent.
the goal is never saving the data in unencrypted form anywhere outside memory, and keeping it at a defined space in the memory (to be overwritten after its no longer needed)
will allocating a large chunk of memory and using it to store the sensitive data be sufficient and ensure that there is no leakage of data ?
From the manual of a tool that handles passwords:
It's also debatable whether mlock() is a proper way to protect sensitive
information. According to POSIX, mlock()-ing a page guarantees that it
is in memory (useful for realtime applications), not that it isn't
in the swap (useful for security applications). Possibly an encrypted
swap partition (or no swap partition) is a better solution.
However, Linux does guarantee that it is not in the swap and specifically discusses the security applications. It also mentions:
But be aware that the suspend mode on laptops and some desktop computers will
save a copy of the system's RAM to disk, regardless of memory locks.
Why don't you use SELinux? Then no process can access other stuff unless you tell it can.
I think if you are securing a program handling sensitive data, you should start by using a secure OS. If the OS is not secure enough then there is nothing your application can do to fix that.
And maybe when using SELinux you don't have to do anything special in your application making your application smaller, simpler and also more secure?
What you want is locking some region of memory into RAM. See the manpage for mlock(2).
Locking the memory (or, if you use Linux, using large pages, since these cannot be paged out) is a good start. All other considerations left aside, this does at least not write plaintext to harddisk in unpredictable ways.
Overwriting memory when no longer needed does not hurt, but is probably useless, because
any pages that are reclaimed and later given to another process will be zeroed out by the operating system anyway (every modern OS does that)
as long as some data is on a computer, you must assume that someone will be able to steal it, one way or the other
there are more exploits in the operating system and in your own code than you are aware of (this happens to the best programmers, and it happens again and again)
There are countless concerns when attempting to prevent someone from stealing sensitive data, and it is by no means an easy endeavour. Encrypting data, trying not to have any obvious exploits, and trying to avoid the most stupid mistakes is as good as you will get. Beyond that, nothing is really safe, because for every N things you plan for, there exists a N+1 thing.
Take my wife's work laptop as a parade example. The intern setting up the machines in their company (at least it's my guess that he's an intern) takes every possible measure and configures everything in paranoia mode to ensure that data on the computer cannot be stolen and that working becomes as much of an ordeal as possible. What you end up with is a bitlocker-protected computer that takes 3 passwords to even boot up, and on which you can practically do nothing, and a screensaver that locks the workstation every time you pick up the phone and forget shaking the mouse. At the same time, this super secure computer has an enabled firewire port over which everybody can read and write anything in the computer's memory without a password.

Dumping buffered data when memory limit is near

My application buffers data for likely requests in the background. Currently I limit the size of the buffer based on a command-line parameter, and begin dumping less-used data when we hit this limit. This is not ideal because it relies on the user to specify a performance-critical parameter. Is there a better way to handle this? Is there a way to automatically monitor system memory use and dump the oldest/least-recently-used data before the system starts to thrash?
A complicating factor here is that my application runs on Linux, OSX, and Windows. But I'll take a good way to do this on only one platform over nothing.
Your best bet would likely be to monitor your applications working set/resident set size, and try to react when it doesn't grow after your allocations. Some pointers on what to look for:
Windows: GetProcessMemoryInfo
Linux: /proc/self/statm
OS X: task_info()
Windows also has GlobalMemoryStatusEx which gives you a nice Available Physical Memory figure.
I like your current solution. Letting the user decide is good. It's not obvious everyone would want the buffer to be as big as possible, is it? If you do invest in implemting some sort of memory monitor for automatically adjusting the buffer/cache size, at least let the user choose between the user set limit and the automatic/dynamic one.
I know this isn't a direct answer, but I'd say step back a bit and maybe don't do this.
Even if you have the API to see current physical memory usage, that's not enough to choose an ideal cache size. That would depend on your typical and future workloads for both the program and the machine (and the overall system of all clients running this program + the server(s) they're querying), the platform's caching behavior, whether the system should be tuned for throughput or latency, and so on. In a tight memory situation, you're going to be competing for memory with other opportunistic caches, including the OS's disk cache. On the one hand, you want to be exerting some pressure on them, to force out other low-value data. On the other hand, if you get greedy while there's plenty of memory, you're going to be affecting the behavior of other adaptive caches.
And with speculative caching/prefetching, the LRU value function is odd: you will (hopefully) fetch the most-likely-to-be-called data first, and less-likely data later, so the LRU data in your prefetch cache may be less valuable than older data. This could lead to perverse behavior in the systemwide set of caches by artificially "heating up" less commonly used data.
It seems unlikely that your program would be able to make a cache size choice better than a simple fixed size, perhaps scaled based on the size of overall physical memory on the machine. And there's very little chance it could beat a sysadmin who knows the machine's typical workload and its performance goals.
Using an adaptive cache sizing strategy means that your program's resource usage is going to be both variable and unpredictable. (With respect to both memory and the I/O and server requests used to populate that prefetch cache.) For a lot of server situations, that's not good. (Especially in HPC or DB servers, which this sounds like it might be for, or a high-utilization/high-throughput environment.) Consistency, configurability, and availability are often more important than maximum resource utilization. And locality of reference tends to fall off quickly, so you're likely getting very diminishing returns with larger cache sizes. If this is going to be used server-side, at least leave the option for explicit control of cache sizes, and probably make that the default, if not only, option.
There is a way: it is called virtual memory (vm). All three operating systems listed will use virtual memory (vm), unless there is no hardware support (which may be true in embedded systems). So I will assume that vm support is present.
Here is a quote from the architecture notes of the Varnish project:
The really short answer is that computers do not have two kinds of storage any more.
I would suggest you read the full text here: http://www.varnish-cache.org/trac/wiki/ArchitectNotes
It is a good read, and I believe will answer your question.
You could attempt to allocate some large-ish block of memory then check for a memory allocation exception. If the exception occurs, dump data. The problem is, this will only work when all system memory (or process limit) is reached. This means your application is likely to start swapping.
try {
char *buf = new char[10 * 1024 * 1024]; // 10 megabytes
free(buf);
} catch (const std::bad_alloc &) {
// Memory allocation failed - clean up old buffers
}
The problems with this approach are:
Running out of system memory can be dangerous and cause random applications to be shut down
Better memory management might be a better solution. If there is data that can be freed, why has it not already been freed? Is there a periodic process you could run to clean up unneeded data?