Currently I'm working on a solution for the memory limits per process. So I came to shared memory. First, I'm using windows 7 with visual studio as developer platform, the software will run on a modern windows server system with multiple CPU's and a huge memory.
Well, I informed my self about memory limits per process, and I need to access much more memory. So my idea was creating multiple processes and use shared memory.
But is it really good to create a lot shared memory? And what about performance?
Well, I informed my self about memory limits per process, and I need to access much more memory. So my idea was creating multiple processes and use shared memory.
The limits on memory per process are for virtual memory. This basically means that your address space has a maximum size (e.g. 4 gigabytes on a system with 32-bit pointers). Since shared memory is a mapping of memory into your address space, there's no way that would get you out of the problem you have.
Keep in mind that if you distribute the memory blocks into multiple processes, you'll eventually reach the limits of physical memory and then system performance will slow to a crawl.
If you really need more memory than your system can grant you, you need to start to persist your data to disk. Memory mapped files can allow you to quickly swap memory blocks in and out of your address space.
# Aurus,
It sounds as though what you require to cover your targets is a custom-engineered solution tailored to specific (albeit under-described) requirements. While Stack Overflow is extremely useful for developers and software engineers seeking professional clarity and programmatic examples, whatever applicable high-level engineering is available within may not be easy to locate and will likely not provide the specific answers you seek. One could make too many assumptions from your post.
Whatever benefit(s) you may (or may not) attain from staggering quantities of RAM and or multiple threads on multiple processors would best be left to those with hard experience building such systems. I have years in the field myself and can confidently express that I myself lack that specific experience. Honestly I hope to avoid that eventuality because high-dollar hardware commonly attends high-pressure schedules, and those can lead to other issues as well. I'll speculate a tiny bit though -- if only because it costs you nothing...
If your intent is firmly fixed upon utilizing Windows platforms my first-order guess is:
a clustered server environment (many multi-core processors for crunching high numbers of threads, backed by a massive quantity of available RAM)
cutting edge drive hardware -- if you're seeking to minimize the impact of frequent virtual memory access you'll likely need to target specific cutting-edge hardware options that enable you to literally replace spindle-drives with more eloquent DRAM sticks, that is to say solid state drives -- not the trivial type you commonly find in modern iPods and mobile PDAs... I refer to the real deal -- classic solid state drives [one fine example is here -- look under hardware]. Their products are two to three orders of magnitude faster than spindle drives and even far faster than consumer solid state as well (albeit not cheap).
Your goals appear to indicate that cost isn't a great concern, but that's about as good as I can give you while lacking more specific information.
One final bit of advice though, when seeking help from engineers it's best to tell them exactly what you seek to accomplish (the goal(s)). Allow them to provide the options and match the limitations of reality and modern technology to your dilemma as well as your financial targets. More often than not, even with esoteric and eccentric requirements, the best solution is actually a custom 'outside-the-box' engineering solution that also ends up being far cheaper to build / implement than a brute-force approach. To put it another way, help the engineers to help you while noting that the GIGO Principle applies as well.
I sincerely hope something I provided is useful. Good luck.
Related
I am doing a school project for an operating systems class. I have to estimate various overheads - for example, time measurement overhead, context switch overhead, memory/disk access overhead. In several of these contexts, I am required to estimate what the software overhead component of the overhead will be, and what the hardware component will be. Could somebody provide an example of what operations would be characterized as hardware overhead and what operations would be software? Am I correct in assuming that setting up the stack when a function is called is software overhead because it only involves pointers being moved? Accessing a block on disk would be hardware overhead? These operations seem simple to characterize. Perhaps somebody could give some other examples to firm up my understanding.
Whoever gave you the assignment should have defined the terms involved, such as software overhead or hardware component. If they did not, you should ask.
It's not as clear cut as it may look. You seem to accept that Accessing a block on disk would be hardware overhead. How about memory, then? Memory is a hardware component, just as a hard drive is. Each memory access requires a measurable, albeit tiny, time. Is that to be factored into the software vs. hardware counts? And that's before even talking about pipes, caches, or virtual memory page faults which can translate into disk access.
I could make similar points about network, GPU, monitor and so on. Main point remains, however, that for an assignment it's always better to ask, than to second guess - and possibly guess wrong.
These are correct examples. Some other examples of hardware overheads include waiting for a device (e.g. printer), and waiting for another node on a network.
A software overhead may be accessing a shared library, as are virtual tables. You'll be hard pressed to find either in kernel space; I don't think shared objects can exist outside of user land.
I have an application written in OpenCV. It consists of two threads that each perform an OpenCV function. How can i determine how much memory each thread is generating?
I'm using libdispatch, Grand Central Dispatch design pattern. It is at a stage where i can have multiple tasks running at once. How can i manage memory in such a situation? With some opencv processes and enough concurrent tasks, i can easily hit my RAM ceiling. How to manage this?
What strategies can be employed in C++?
If each thread had a memory limit, how could this be handled?
Regards,
Daniel
I'm not familiar with the dispatching library/pattern you're using, but I've had a quick glance over what it aims to do. I've done a fair amount of work in the image processing/video processing domain, so hopefully my answer isn't a completely useless wall-of-text ;)
My suspicion is that you're firing off whole image buffers to different threads to run the same processing on them. If this is the case, then you're quickly going to hit RAM limits. If a task (thread) uses N image buffers in its internal functions, and your RAM is M, then you may start running out of legs at M / N tasks (threads). If this is the case, then you may need to resort to firing off chunks of images to the threads instead (see the hints further down on using dependency graphs for processing).
You should also consider the possibility that performance in your particular algorithm is memory bound and not CPU bound. So it may be pointless firing off more threads even though you have extra cores, and perhaps in this case you're better off focusing on CPU SIMD things like SSE/MMX.
Profile first, Ask (Memory Allocator) Questions Later
Using hand-rolled memory allocators that cater for concurrent environments and your specific memory requirements can make a big difference to performance. However, they're unlikely to reduce the amount of memory you use unless you're working with many small objects, where you may be able to do a better job with memory layout when allocating and reclaiming them than the default malloc/free implementations. As you're working with image processing algorithms, the latter is unlikely. You've typically got huge image buffers allocated on the heap as opposed to many small-ish structs.
I'll add a few tips on where to begin reading on rolling your own allocators at the end of my answer, but in general my advice would be to first profile and figure out where the memory is being used. Having written the code you may have a good hunch about where it's going already, but if not tools like valgrind's massif (complicated beast) can be a big help.
After having profiled the code, figure out how you can reduce the memory use. There are many, many things you can do here, depending on what's using the memory. For example:
Free up any memory you don't need as soon as you're done with it. RAII can come in handy here.
Don't copy memory unless you need to.
Share memory between threads and processes where appropriate. It will make it more difficult than working with immutable/copied data, because you'll have to synchronise read/write access, but depending on your problem case it may make a big difference.
If you're using memory caches, and you don't want to cache the data to disk due to performance reasons, then consider using in-memory compression (e.g. zipping some of the cache) when it's falling to the bottom of your least-recently-used cache, for example.
Instead of loading a whole dataset and having each method operate on the whole of it, see if you can chunk it up and only operate on a subset of it. This is particularly relevant when dealing with large data sets.
See if you can get away with using less resolution or accuracy, e.g. quarter-size instead of full size images, or 32 bit floats instead of 64 bit floats (or even custom libraries for 16 bit floats), or perhaps using only one channel of image data at a time (just red, or just blue, or just green, or greyscale instead of RGB).
As you're working with OpenCV, I'm guessing you're either working on image processing or video processing. These can easily gobble up masses of memory. In my experience, initial R&D implementations typically process a whole image buffer in one method before passing it over to the next. This often results in multiple full image buffers being used, which is hugely expensive in terms of memory consumption. Reducing the use of any temporary buffers can be a big win here.
Another approach to alleviate this is to see if you can figure out the data dependencies (e.g. by looking at the ROIs required for low-pass filters, for example), and then processing smaller chunks of the images and joining them up again later, and to avoid temporary duplicate buffers as much as possible. Reducing the memory footprint in this way can be a big win, as you're also typically reducing the chances of a cache miss. Such approaches often hugely complicate the implementation, and unless you have a graph-based framework in place that already supports it, it's probably not something you should attempt before exhausting other options. Intel have a number of great resources pertaining to optimisation of threaded image processing applications.
Tips on Memory Allocators
If you still think playing with memory allocators is going to be useful, here are some tips.
For example, on Linux, you could use
malloc hooks, or
just override them in your main compilation unit (main.cpp), or a library that you statically link, or a shared libary that you LD_PRELOAD, for example.
There are several excellent malloc/free replacements available that you could study for ideas, e.g.
dlmalloc
tcmalloc
If you're dealing with specific C++ objects, then you can override their new and delete operators. See this link, for example.
Lastly, if I did manage to guess wrong regarding where memory is being used, and you do, in fact, have loads of small objects, then search the web for 'small memory allocators'. Alexander Alexandrescu wrote a couple of great articles on this, e.g. here and here.
Problem: A large-scale simulation game has a ridiculous number of distinct objects which have to be tracked, updated, and used for both visual rendering and for logical model updating. With only 4 GB of address space, you can only fit so many things into memory. If you resort to disk, things start to slow down unless you get lucky and are constantly hitting the page cache. But even then, making a lot of updates/writes is going to be costly when the filesystem syncs to disk.
Let's assume the user has at least 32GB of RAM (a few have reported 64) and wants to play an enormous simulation, causing the model to carry around an order of magnitude more data than most of the stuff the game has been developed to handle. They of course have a 64-bit OS (let's say, Windows 7 x64 or Windows 8 x64). Naturally, if you just store all of this model data in virtual address space in-process, even with Large Address Aware you're going to run into an out of memory situation, even if the host has gigabytes and gigabytes of free RAM (because the 32-bit process is out of Virtual Address Space (VAS)).
Let's also assume that, for reasons entirely out of your control, you can't make the main binary 64-bit. You depend on some proprietary framework, have spent ridiculous man hours coding to that framework, and would have to start over from square one to move to something else. Your framework only ships a 32-bit version, so you're stuck.
Right?
Maybe.
I had a random thought, and it seems like a long shot, because I don't know if I could make it efficient or practical.
If I could create a child 64-bit process, it would be able to use, for all practical purposes, as much RAM as anyone today can afford to buy and slot into a motherboard, even on a very high-end server chassis.
Now, I want to be able to efficiently store and retrieve large chunks of data from the model that's been shoved into the child process, and get subsections of that data copied into the child process from time to time. So basically, the "main" model in all its gigabytes of glory (a series very huge tree-ish and hashtable-ish structures) will sit in the 64-bit process, and the 32-bit process will peek in there, grab a hunk of data, do stuff with it (or maybe I should ask the child process to do some processing in there to distill it down??), then get rid of it -- to keep the memory usage manageable in the 32-bit process.
All the model read/mutate/simulate algorithms are based around the assumption that the model is available locally in-process, so things like random array access are common. It would be hard for me to distill down my access patterns to a few chunk-based sequential reads from the main model, and a walk of the whole model isn't terribly uncommon either.
My goals are:
Keep the darn thing from crashing due to out of memory (#1 goal)
Performance (a very close #2, but people using an extreme amount of complexity may come to accept worse performance than those simulating a smaller, simpler game)
Minimal refactoring of the existing code (more or less vanilla C++ with rendering calls and multithreading)
It seems like a pretty hefty project to even undertake, since going from a coherent memory model to having to essentially look through an aperture at a much larger model than I can grab at any one time will probably necessitate a lot of algorithm redesign.
My questions:
Is there any precedent for doing this?
How would this best be done on Windows? Is there some kind of shared memory like on Linux, or lightweight very high-bandwidth random memory access IPC that can be integrated into C++ via something like an operator[]() implementation?
Is the performance of any IPC going to be so bad that it's not even worth trying? Should I just rely on the disk (you know, databases or key-value or whatever) and let the operating system / filesystem figure out how to use the RAM?
Keep in mind that I'd need support for an extremely "chatty" IPC mechanism, because a lot of the processing algorithms (AI, etc) are designed around small memory accesses and updates. This works well enough in-process, and there is even some attention towards cache locality, but all that turns weird when you are accessing it over IPC.
I was in a similar situation to you, GUI was 32bit but needed x64 code to interface with a system. The approach we took was to use WM_COPYDATA and pass data back and forth across the magical process bit boundary. Naturally, it is not as fast as using a dll, but that was not an option. The performance tradeoff was acceptable for our use case.
If understand you correctly.
If you use multiple processes then windows will still need to page the sections in and out.
This is how I'd envisage trying it.
Use memory mapped files to map a view/section of the memory space persisted on disk that you need. You of course will need to have some internal mapping scheme to.
The key thing that I don't know at this point is whether you can access the 64 bit API from 32 bits.
Windows will take care of the paging magically and efficiently. This is what it does for memory mapping virtual memory anyway. We used to use it to work with massive datasets on the early 32 bit NT systems and the technology just works.
Is there a way to determine exactly what values, memory addresses, and/or other information currently resides in the CPU cache (L1, L2, etc.) - for current or all processes?
I've been doing quite a bit a reading which shows how to optimize programs to utilize the CPU cache more effectively. However, I'm looking for a way to truly determine if certain approaches are effective.
Bottom line: is it possible to be 100% certain what does and does not make it into the CPU cache.
Searching for this topic returns several results on how to determine the cache size, but not contents.
Edit: To clarify some of the comments below: Since software would undoubtedly alter the cache, do CPU manufactures have a tool / hardware diagnostic system (built-in) which provides this functionality?
Without using specialized hardware, you cannot directly inspect what is in the CPU cache. The act of running any software to inspect the CPU cache would alter the state of the cache.
The best approach I have found is simply to identify real hot spots in your application and benchmark alternative algorithms on hardware the code will run on in production (or on a range of likely hardware if you do not have control over the production environment).
In addition to Eric J.'s answer, I'll add that while I'm sure the big chip manufacturers do have such tools it's unlikely that such a "debug" facility would be made available to regular mortals like you and I, but even if it were, it wouldn't really be of much help.
Why? It's unlikely that you are having performance issues that you've traced to cache and which cannot be solved using the well-known and "common sense" techniques for maintaining high cache-hit ratios.
Have you really optimized all other hotspots in the code and poor cache behavior by the CPU is the problem? I very much doubt that.
Additionally, as food for thought: do you really want to optimize your program's behavior to only one or two particular CPUs? After all, caching algorithms change all the time, as do the parameters of the caches, sometimes dramatically.
If you have a relatively modern processor running Windows then take a look at
http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
and see if that might provide some of what you are looking for.
To optimize for one specific CPU cache size is usually in vain since this optimization will break when your assumptions about the CPU cache sizes are wrong when you execute on a different CPU.
But there is a way out there. You should optimize for certain access patterns to allow the CPU to easily predict what memory locations should be read next (the most obvious one is a linear increasing read). To be able to fully utilize a CPU you should read about cache oblivious algorithms where most of them follow a divide and conquer strategy where a problem is divided into sub parts to a certain extent until all memory accesses fit completly into the CPU cache.
It is also noteworthy to mention that you have a code and data cache which are separate. Herb Sutter has a nice video online where he talks about the CPU internals in depth.
The Visual Studio Profiler can collect CPU counters dealing with memory and L2 counters. These options are available when you select instrumentation profiling.
Intel has also a paper online which talks in greater detail about these CPU counters and what the task manager of Windows and Linux do show you and how wrong it is for todays CPUs which do work internally asynchronous and parallel at many diffent levels. Unfortunatley there is no tool from intel to display this stuff directly. The only tool I do know is the VS profiler. Perhaps VTune has similar capabilities.
If you have gone this far to optimize your code you might look as well into GPU programming. You need at least a PHD to get your head around SIMD instructions, cache locality, ... to get perhaps a factor 5 over your original design. But by porting your algorithm to a GPU you get a factor 100 with much less effort ony a decent graphics card. NVidia GPUs which do support CUDA (all today sold cards do support it) can be very nicely programmed in a C dialect. There are even wrapper for managed code (.NET) to take advantage of the full power of GPUs.
You can stay platform agnostic by using OpenCL but NVidia OpenCL support is very bad. The OpenCL drivers are at least 8 times slower than its CUDA counterpart.
Almost everything you do will be in the cache at the moment when you use it, unless you are reading memory that has been configured as "uncacheable" - typically, that's frame buffer memory of your graphics card. The other way to "not hit the cache" is to use specific load and store instructions that are "non-temporal". Everything else is read into the L1 cache before it reaches the target registers inside the CPU itself.
For nearly all cases, CPU's do have a fairly good system of knowing what to keep and what to throw away in the cache, and the cache is nearly always "full" - not necessarily of useful stuff, if, for example you are working your way through an enormous array, it will just contain a lot of "old array" [this is where the "non-temporal" memory operations come in handy, as they allow you to read and/or write data that won't be stored in the cache, since next time you get back to the same point, it won't be in the cache ANYWAYS].
And yes, processors usually have special registers [that can be accessed in kernel drivers] that can inspect the contents of the cache. But they are quite tricky to use without at the same time losing the content of the cache(s). And they are definitely not useful as "how much of array A is in the cache" type checking. They are specifically for "Hmm, it looks like cache-line 1234 is broken, I'd better read the cached data to see if it's really the value it should be" when processors aren't working as they should.
As DanS says, there are performance counters that you can read from suitable software [need to be in the kernel to use those registers too, so you need some sort of "driver" software for that]. In Linux, there's "perf". And AMD has a similar set of performance counters that can be used to find out, for example "how many cache misses have we had over this period of time" or "how many cache hits in L" have we had, etc.
I was wondering if it is possible to run an executable program without adding to its source code, like running any game across several computers. When i was programming in c# i noticed a process method, which lets you summon or close any application or process, i was wondering if there was something similar with c++ which would let me transfer the processes of any executable file or game to other computers or servers minimizing my computer's processor consumption.
thanks.
Everything is possible, but this would require a huge amount of work and would almost for sure make your program painfully slower (I'm talking about a factor of millions or billions here). Essentially you would need to make sure every layer that is used in the program allows this. So you'd have to rewrite the OS to be able to do this, but also quite a few of the libraries it uses.
Why? Let's assume you want to distribute actual threads over different machines. It would be slightly more easy if it were actual processes, but I'd be surprised many applications work like this.
To begin with, you need to synchronize the memory, more specifically all non-thread-local storage, which often means 'all memory' because not all language have a thread-aware memory model. Of course, this can be optimized, for example buffer everything until you encounter an 'atomic' read or write, if of course your system has such a concept. Now can you imagine every thread blocking for synchronization a few seconds whenever a thread has to be locked/unlocked or an atomic variable has to be read/written?
Next to that there are the issues related to managing devices. Assume you need a network connection: which device will start this, how will the ip be chosen, ...? To seamlessly solve this you probably need a virtual device shared amongst all platforms. This has to happen for network devices, filesystems, printers, monitors, ... . And as you kindly mention games: this should happen for a GPU as well, just imagine how this would impact performance in only sending data from/to the GPU (hint: even 16xpci-e is often already a bottleneck).
In conclusion: this is not feasible, if you want a clustered application, you have to build it into the application from scratch.
I believe the closest thing you can do is MapReduce: it's a paradigm which hopefully will be a part of the official boost library soon. However, I don't think that you would want to apply it to a real-time application like a game.
A related question may provide more answers: https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
But as KillianDS pointed out, there is no automagical way to do this, nor does it seem like is there a feasible way to do it. So what is the exact problem that you're trying to solve?
The current state of research is into practical means to distribute the work of a process across multiple CPU cores on a single computer. In that case, these processors still share RAM. This is essential: RAM latencies are measured in nanoseconds.
In distributed computing, remote memory access can take tens if not hundreds of microseconds. Distributed algorithms explicitly take this into account. No amount of magic can make this disappear: light itself is slow.
The Plan 9 OS from AT&T Bell Labs supports distributed computing in the most seamless and transparent manner. Plan 9 was designed to take the Unix ideas of breaking jobs into interoperating small tasks, performed by highly specialised utilities, and "everything is a file", as well as the client/server model, to a whole new level. It has the idea of a CPU server which performs computations for less powerful networked clients. Unfortunately the idea was too ambitious and way beyond its time and Plan 9 remained largerly a research project. It is still being developed as open source software though.
MOSIX is another distributed OS project that provides a single process space over multiple machines and supports transparent process migration. It allows processes to become migratable without any changes to their source code as all context saving and restoration are done by the OS kernel. There are several implementations of the MOSIX model - MOSIX2, openMosix (discontinued since 2008) and LinuxPMI (continuation of the openMosix project).
ScaleMP is yet another commercial Single System Image (SSI) implementation, mainly targeted towards data processing and Hight Performance Computing. It not only provides transparent migration between the nodes of a cluster but also provides emulated shared memory (known as Distributed Shared Memory). Basically it transforms a bunch of computers, connected via very fast network, into a single big NUMA machine with many CPUs and huge amount of memory.
None of these would allow you to launch a game on your PC and have it transparently migrated and executed somewhere on the network. Besides most games are GPU intensive and not so much CPU intensive - most games are still not even utilising the full computing power of multicore CPUs. We have a ScaleMP cluster here and it doesn't run Quake very well...