C++ Memory Counting in OpenCV - c++

I have an application written in OpenCV. It consists of two threads that each perform an OpenCV function. How can i determine how much memory each thread is generating?
I'm using libdispatch, Grand Central Dispatch design pattern. It is at a stage where i can have multiple tasks running at once. How can i manage memory in such a situation? With some opencv processes and enough concurrent tasks, i can easily hit my RAM ceiling. How to manage this?
What strategies can be employed in C++?
If each thread had a memory limit, how could this be handled?
Regards,
Daniel

I'm not familiar with the dispatching library/pattern you're using, but I've had a quick glance over what it aims to do. I've done a fair amount of work in the image processing/video processing domain, so hopefully my answer isn't a completely useless wall-of-text ;)
My suspicion is that you're firing off whole image buffers to different threads to run the same processing on them. If this is the case, then you're quickly going to hit RAM limits. If a task (thread) uses N image buffers in its internal functions, and your RAM is M, then you may start running out of legs at M / N tasks (threads). If this is the case, then you may need to resort to firing off chunks of images to the threads instead (see the hints further down on using dependency graphs for processing).
You should also consider the possibility that performance in your particular algorithm is memory bound and not CPU bound. So it may be pointless firing off more threads even though you have extra cores, and perhaps in this case you're better off focusing on CPU SIMD things like SSE/MMX.
Profile first, Ask (Memory Allocator) Questions Later
Using hand-rolled memory allocators that cater for concurrent environments and your specific memory requirements can make a big difference to performance. However, they're unlikely to reduce the amount of memory you use unless you're working with many small objects, where you may be able to do a better job with memory layout when allocating and reclaiming them than the default malloc/free implementations. As you're working with image processing algorithms, the latter is unlikely. You've typically got huge image buffers allocated on the heap as opposed to many small-ish structs.
I'll add a few tips on where to begin reading on rolling your own allocators at the end of my answer, but in general my advice would be to first profile and figure out where the memory is being used. Having written the code you may have a good hunch about where it's going already, but if not tools like valgrind's massif (complicated beast) can be a big help.
After having profiled the code, figure out how you can reduce the memory use. There are many, many things you can do here, depending on what's using the memory. For example:
Free up any memory you don't need as soon as you're done with it. RAII can come in handy here.
Don't copy memory unless you need to.
Share memory between threads and processes where appropriate. It will make it more difficult than working with immutable/copied data, because you'll have to synchronise read/write access, but depending on your problem case it may make a big difference.
If you're using memory caches, and you don't want to cache the data to disk due to performance reasons, then consider using in-memory compression (e.g. zipping some of the cache) when it's falling to the bottom of your least-recently-used cache, for example.
Instead of loading a whole dataset and having each method operate on the whole of it, see if you can chunk it up and only operate on a subset of it. This is particularly relevant when dealing with large data sets.
See if you can get away with using less resolution or accuracy, e.g. quarter-size instead of full size images, or 32 bit floats instead of 64 bit floats (or even custom libraries for 16 bit floats), or perhaps using only one channel of image data at a time (just red, or just blue, or just green, or greyscale instead of RGB).
As you're working with OpenCV, I'm guessing you're either working on image processing or video processing. These can easily gobble up masses of memory. In my experience, initial R&D implementations typically process a whole image buffer in one method before passing it over to the next. This often results in multiple full image buffers being used, which is hugely expensive in terms of memory consumption. Reducing the use of any temporary buffers can be a big win here.
Another approach to alleviate this is to see if you can figure out the data dependencies (e.g. by looking at the ROIs required for low-pass filters, for example), and then processing smaller chunks of the images and joining them up again later, and to avoid temporary duplicate buffers as much as possible. Reducing the memory footprint in this way can be a big win, as you're also typically reducing the chances of a cache miss. Such approaches often hugely complicate the implementation, and unless you have a graph-based framework in place that already supports it, it's probably not something you should attempt before exhausting other options. Intel have a number of great resources pertaining to optimisation of threaded image processing applications.
Tips on Memory Allocators
If you still think playing with memory allocators is going to be useful, here are some tips.
For example, on Linux, you could use
malloc hooks, or
just override them in your main compilation unit (main.cpp), or a library that you statically link, or a shared libary that you LD_PRELOAD, for example.
There are several excellent malloc/free replacements available that you could study for ideas, e.g.
dlmalloc
tcmalloc
If you're dealing with specific C++ objects, then you can override their new and delete operators. See this link, for example.
Lastly, if I did manage to guess wrong regarding where memory is being used, and you do, in fact, have loads of small objects, then search the web for 'small memory allocators'. Alexander Alexandrescu wrote a couple of great articles on this, e.g. here and here.

Related

Implementing a memory manager in multithreaded C/C++ with dynamically sized memory pool?

Background: I'm developing a multiplatform framework of sorts that will be used as base for both game and util/tool creation. The basic idea is to have a pool of workers, each executing in its own thread. (Furthermore, workers will also be able to spawn at runtime.) Each thread will have it's own memory manager.
I have long thought about creating my own memory management system, and I think this project will be perfect to finally give it a try. I find such a system fitting due to the types of usages of this framework will often require memory allocations in realtime (games and texture edition tools).
Problems:
No generally applicable solution(?) - The framework will be used for both games/visualization (not AAA, but indie/play) and tool/application creation. My understanding is that for game development it is usual (at least for console games) to allocate a big chunk of memory only once in the initialization, and then use this memory internally in the memory manager. But is this technique applicable in a more general application?
In a game you could theoretically know how much memory your scenes and resources will need, but for example, a photo editing application will load resources of all different sizes... So in the latter case a more dynamic memory "chunk size" would be needed? Which leads me to the next problem:
Moving already allocated data and keeping valid pointers - Normally when allocating on the heap, you will acquire a simple pointer to the memory chunk. In a custom memory manager, as far as I understand it, a similar approach is then to return a pointer to somewhere free in the pre-allocated chunk. But what happens if the pre-allocated chunk is too small and needs to be resized or even defragmentated? The data would be needed to be moved around in the memory and the old pointers would be invalid. Is there a way to transparently wrap these pointers in some way, but still use them as normally "outside" the memory management as if they were usual C++ pointers?
Third party libraries - If there is no way to transparently use a custom memory management system for all memory allocation in the application, every third party library I'm linking with, will still use the "old" OS memory allocations internally. I have learned that it is common for libraries to expose functions to set custom allocation functions that the library will use, but it is not guaranteed every library I will use will have this ability.
Questions: Is it possible and feasible to implement a memory manager that can use a dynamically sized memory chunk pool? If so, how would defragmentation and memory resize work, without breaking currently in-use pointers? And finally, how is such a system best implemented to work with third party libraries?
I'm also thankful for any related reading material, papers, articles and whatnot! :-)
As someone who has previously written many memory managers and heap implementations for AAA games for the last few generations of consoles let me tell you its simply not worth it anymore.
Your information is old - back in the gamecube era [circa 2003] we used to do what you said- allocate a large chunk and carve out that chunk manually using custom algorithms tweaked for each game.
Once virtual memory came along (xbox era), games got more complicated [and so made more allocations and became multimthreaded] address fragmentation made this untenable. So we switched to custom allocators to handle certain types of requests only - for instance physical memory, or lock free small block low fragmentation heaps or thread local cache of recently used blocks.
As built in memory managers become better it gets harder to do better than those - certainly in the general case and a close thing for a specific use cases. Doug Lea Allocator [or whatever the mainstream c++ linux compilers come with now] and the latest Windows low fragmentation heaps are really very good, and you'd do far better investing your time elsewhere.
I've got spreadsheets at work measuring all kinds of metrics for a whole load of allocators - all the big name ones and a fair few I've collected over the years. And basically whilst the specialist allocators can win on a few metrics [lowest overhead per alloc, spacial proximity, lowest fragmentation, etc] for overall metrics the mainstream ones are simply the best.
As a user of your library, my personal preferred option is you just allocate memory when you need it. Use operator new/the new operator and I can use the standard C++ mechanisms to replace those and use my custom heap (if I indeed have one), or alternatively I can use platform specific ways of replacing your allocations (e.g. XMemAlloc on Xbox). I don't need tagging [capturing callstacks is far superior which I can do if I want]. Lower down that list comes you giving me an interface that you'll call when you need to allocate memory - this is just a pain for you to implement and I'll probably just pass it onto operator new anyway. The worst thing you can do is 'know best' and create your own custom heaps. If memory allocation performance is a problem, I'd much rather you share the solution the whole game uses than roll your own.
If you're looking to write your own malloc()/free(), etc., you probably should start by checking out the source code for existing systems such as dlmalloc. This is a hard problem, though, for what it's worth. Writing your own malloc library is Hard. Beating existing general purpose malloc libraries will be Even Harder.
And now, here is the correct answer: DON'T IMPLEMENT YET ANOTHER MEMORY MANAGER.
It is incredibly hard to implement a memory manager that does not fail under different kinds of usage patterns and events. You may be able to build a specific manager that works well under YOUR usage patterns, but to write one which works well for MANY users is a full-time job that almost no one has really done well. Worse, it is fantastically easy to implement a memory manager that works great 99% of the time and then 1% of the time crash or suddenly consume most or all available memory on your system due to unexpected heap fragmentation.
I say this as someone who has written multiple memory managers, watched multiple people write their own memory managers, and watched even more people attempt to write memory managers and fail. This problem is deceptively difficult, not because it's hard to write templated allocators and generic types with inheritance and such, but because the other solutions given in this thread tend to fail under corner types of load behavior. Once you start supporting byte alignments (as all real-world allocators must) then heap fragmentation rears its ugly head. Cute heuristics that work great for small test programs, fail miserably when subjected to large, real-world programs.
And once you get it working, someone else will need: cookies to verify against memory stomps; heap usage reporting; memory pools; pools of pools; memory leak tracking and reporting; heap auditing; chunk splitting and coalescing; thread-local storage; lookasides; CPU and process-level page faulting and protection; setting and checking and clearing "free-memory" patterns aka 0xdeadbeef; and whatever else I can't think of off the top of my head.
Writing yet another memory manager falls squarely under the heading of Premature Optimization. Since there are multiple free, good, memory managers with thousands of hours of development and testing behind them, you have to justify spending the cost of your own time in such a way that the result would provide some sort of measurable improvement over what other people have done, and you can use, for free.
If you are SURE you want to implement your own memory manager (and hopefully you are NOT sure after reading this message), read through the dlmalloc sources in detail, then read through the tcmalloc sources in detail as well, THEN make sure you understand the performance trade-offs in implementing a thread-safe versus a thread-unsafe memory manager, and why the naive implementations tend to give poor performance results.
Prepare more than one solution and let the user of the framework adopt any particular one. Policy classes to the generic allocator you develop would do this nicely.
A nice way to get around this is to wrap up pointers in a class with overloaded * operator. Make the internal data of that class only an index to the memory pool. Now, you can just change the index quickly after a background thread copies the data over.
Most good C++ libraries support allocators and you should implement one. You can also overload the global new so your version gets used. And keep in mind that you generally won't need to think about a library allocating or deallocating a large amount of data, which is generally a responsibility of client code.

Out-of-process memory heap to work around 32-bit address space

Problem: A large-scale simulation game has a ridiculous number of distinct objects which have to be tracked, updated, and used for both visual rendering and for logical model updating. With only 4 GB of address space, you can only fit so many things into memory. If you resort to disk, things start to slow down unless you get lucky and are constantly hitting the page cache. But even then, making a lot of updates/writes is going to be costly when the filesystem syncs to disk.
Let's assume the user has at least 32GB of RAM (a few have reported 64) and wants to play an enormous simulation, causing the model to carry around an order of magnitude more data than most of the stuff the game has been developed to handle. They of course have a 64-bit OS (let's say, Windows 7 x64 or Windows 8 x64). Naturally, if you just store all of this model data in virtual address space in-process, even with Large Address Aware you're going to run into an out of memory situation, even if the host has gigabytes and gigabytes of free RAM (because the 32-bit process is out of Virtual Address Space (VAS)).
Let's also assume that, for reasons entirely out of your control, you can't make the main binary 64-bit. You depend on some proprietary framework, have spent ridiculous man hours coding to that framework, and would have to start over from square one to move to something else. Your framework only ships a 32-bit version, so you're stuck.
Right?
Maybe.
I had a random thought, and it seems like a long shot, because I don't know if I could make it efficient or practical.
If I could create a child 64-bit process, it would be able to use, for all practical purposes, as much RAM as anyone today can afford to buy and slot into a motherboard, even on a very high-end server chassis.
Now, I want to be able to efficiently store and retrieve large chunks of data from the model that's been shoved into the child process, and get subsections of that data copied into the child process from time to time. So basically, the "main" model in all its gigabytes of glory (a series very huge tree-ish and hashtable-ish structures) will sit in the 64-bit process, and the 32-bit process will peek in there, grab a hunk of data, do stuff with it (or maybe I should ask the child process to do some processing in there to distill it down??), then get rid of it -- to keep the memory usage manageable in the 32-bit process.
All the model read/mutate/simulate algorithms are based around the assumption that the model is available locally in-process, so things like random array access are common. It would be hard for me to distill down my access patterns to a few chunk-based sequential reads from the main model, and a walk of the whole model isn't terribly uncommon either.
My goals are:
Keep the darn thing from crashing due to out of memory (#1 goal)
Performance (a very close #2, but people using an extreme amount of complexity may come to accept worse performance than those simulating a smaller, simpler game)
Minimal refactoring of the existing code (more or less vanilla C++ with rendering calls and multithreading)
It seems like a pretty hefty project to even undertake, since going from a coherent memory model to having to essentially look through an aperture at a much larger model than I can grab at any one time will probably necessitate a lot of algorithm redesign.
My questions:
Is there any precedent for doing this?
How would this best be done on Windows? Is there some kind of shared memory like on Linux, or lightweight very high-bandwidth random memory access IPC that can be integrated into C++ via something like an operator[]() implementation?
Is the performance of any IPC going to be so bad that it's not even worth trying? Should I just rely on the disk (you know, databases or key-value or whatever) and let the operating system / filesystem figure out how to use the RAM?
Keep in mind that I'd need support for an extremely "chatty" IPC mechanism, because a lot of the processing algorithms (AI, etc) are designed around small memory accesses and updates. This works well enough in-process, and there is even some attention towards cache locality, but all that turns weird when you are accessing it over IPC.
I was in a similar situation to you, GUI was 32bit but needed x64 code to interface with a system. The approach we took was to use WM_COPYDATA and pass data back and forth across the magical process bit boundary. Naturally, it is not as fast as using a dll, but that was not an option. The performance tradeoff was acceptable for our use case.
If understand you correctly.
If you use multiple processes then windows will still need to page the sections in and out.
This is how I'd envisage trying it.
Use memory mapped files to map a view/section of the memory space persisted on disk that you need. You of course will need to have some internal mapping scheme to.
The key thing that I don't know at this point is whether you can access the 64 bit API from 32 bits.
Windows will take care of the paging magically and efficiently. This is what it does for memory mapping virtual memory anyway. We used to use it to work with massive datasets on the early 32 bit NT systems and the technology just works.

Allocate large blocks of contiguous memory - do or don't?

I have always been convinced that it is not a good practice to allocate large blocks of contiguous memory. It is clear that you are likely to run into trouble if memory fragmentation comes into play, which in most cases cannot be excluded for sure (especially in large projects designed as services or the like).
Recently I came accross the ITK image processing library and realized, that they (virtually) always allocate image data (even 3D - which might be huge) as one contiguous block. I was told that this should not be a problem, at least for 64 bit processes. However, I don't see a systematic difference between 64 bit and 32 bit processes besides the fact that memory problems might occurr delayed due to the larger virtual address space.
To come to the point: I wonder what is good practice when dealing with large amounts of data: Simply allocate it as one big block, or better split it up into smaller pieces for allocation?
As the question is of course rather system specific I would like to restrict it to native (unmanaged, no CLR) C++ especially under windows. However, I would be also interested in any more general comments - if possible.
The question almost seems nonsensical... let me rephrase it to illustrate:
If you need a large block of memory and are worried about fragmentation, should you just fragment it yourself?
You don't gain anything by fragmenting it yourself rather than letting the system memory manager fragment it for you. The system is extremely good at this, and you are not likely to do it better.
That being said, if all things being equal you can do the same task but broken into sensible fragments, it may be worth profiling to see if you can gain anything. But in general, you won't gain anything in a reasonable sense -- you won't be able to outperform the OS.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.
It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.
Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.
As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?
The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".
One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

mmap() vs. reading blocks

I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.
Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.
The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.
How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?
I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.
A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.
However,
Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
Reading a file directly is very simple and fast.
The discussion of mmap/read reminds me of two other performance discussions:
Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.
Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.
Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.
(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)
There are lots of good answers here already that cover many of the salient points, so I'll just add a couple of issues I didn't see addressed directly above. That is, this answer shouldn't be considered a comprehensive of the pros and cons, but rather an addendum to other answers here.
mmap seems like magic
Taking the case where the file is already fully cached1 as the baseline2, mmap might seem pretty much like magic:
mmap only requires 1 system call to (potentially) map the entire file, after which no more system calls are needed.
mmap doesn't require a copy of the file data from kernel to user-space.
mmap allows you to access the file "as memory", including processing it with whatever advanced tricks you can do against memory, such as compiler auto-vectorization, SIMD intrinsics, prefetching, optimized in-memory parsing routines, OpenMP, etc.
In the case that the file is already in the cache, it seems impossible to beat: you just directly access the kernel page cache as memory and it can't get faster than that.
Well, it can.
mmap is not actually magic because...
mmap still does per-page work
A primary hidden cost of mmap vs read(2) (which is really the comparable OS-level syscall for reading blocks) is that with mmap you'll need to do "some work" for every 4K page accessed in a new mapping, even though it might be hidden by the page-fault mechanism.
For a example a typical implementation that just mmaps the entire file will need to fault-in so 100 GB / 4K = 25 million faults to read a 100 GB file. Now, these will be minor faults, but 25 million page faults is still not going to be super fast. The cost of a minor fault is probably in the 100s of nanos in the best case.
mmap relies heavily on TLB performance
Now, you can pass MAP_POPULATE to mmap to tell it to set up all the page tables before returning, so there should be no page faults while accessing it. Now, this has the little problem that it also reads the entire file into RAM, which is going to blow up if you try to map a 100GB file - but let's ignore that for now3. The kernel needs to do per-page work to set up these page tables (shows up as kernel time). This ends up being a major cost in the mmap approach, and it's proportional to the file size (i.e., it doesn't get relatively less important as the file size grows)4.
Finally, even in user-space accessing such a mapping isn't exactly free (compared to large memory buffers not originating from a file-based mmap) - even once the page tables are set up, each access to a new page is going to, conceptually, incur a TLB miss. Since mmaping a file means using the page cache and its 4K pages, you again incur this cost 25 million times for a 100GB file.
Now, the actual cost of these TLB misses depends heavily on at least the following aspects of your hardware: (a) how many 4K TLB enties you have and how the rest of the translation caching works performs (b) how well hardware prefetch deals with with the TLB - e.g., can prefetch trigger a page walk? (c) how fast and how parallel the page walking hardware is. On modern high-end x86 Intel processors, the page walking hardware is in general very strong: there are at least 2 parallel page walkers, a page walk can occur concurrently with continued execution, and hardware prefetching can trigger a page walk. So the TLB impact on a streaming read load is fairly low - and such a load will often perform similarly regardless of the page size. Other hardware is usually much worse, however!
read() avoids these pitfalls
The read() syscall, which is what generally underlies the "block read" type calls offered e.g., in C, C++ and other languages has one primary disadvantage that everyone is well-aware of:
Every read() call of N bytes must copy N bytes from kernel to user space.
On the other hand, it avoids most the costs above - you don't need to map in 25 million 4K pages into user space. You can usually malloc a single buffer small buffer in user space, and re-use that repeatedly for all your read calls. On the kernel side, there is almost no issue with 4K pages or TLB misses because all of RAM is usually linearly mapped using a few very large pages (e.g., 1 GB pages on x86), so the underlying pages in the page cache are covered very efficiently in kernel space.
So basically you have the following comparison to determine which is faster for a single read of a large file:
Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?
On many systems, they are actually approximately balanced. Note that each one scales with completely different attributes of the hardware and OS stack.
In particular, the mmap approach becomes relatively faster when:
The OS has fast minor-fault handling and especially minor-fault bulking optimizations such as fault-around.
The OS has a good MAP_POPULATE implementation which can efficiently process large maps in cases where, for example, the underlying pages are contiguous in physical memory.
The hardware has strong page translation performance, such as large TLBs, fast second level TLBs, fast and parallel page-walkers, good prefetch interaction with translation and so on.
... while the read() approach becomes relatively faster when:
The read() syscall has good copy performance. E.g., good copy_to_user performance on the kernel side.
The kernel has an efficient (relative to userland) way to map memory, e.g., using only a few large pages with hardware support.
The kernel has fast syscalls and a way to keep kernel TLB entries around across syscalls.
The hardware factors above vary wildly across different platforms, even within the same family (e.g., within x86 generations and especially market segments) and definitely across architectures (e.g., ARM vs x86 vs PPC).
The OS factors keep changing as well, with various improvements on both sides causing a large jump in the relative speed for one approach or the other. A recent list includes:
Addition of fault-around, described above, which really helps the mmap case without MAP_POPULATE.
Addition of fast-path copy_to_user methods in arch/x86/lib/copy_user_64.S, e.g., using REP MOVQ when it is fast, which really help the read() case.
Update after Spectre and Meltdown
The mitigations for the Spectre and Meltdown vulnerabilities considerably increased the cost of a system call. On the systems I've measured, the cost of a "do nothing" system call (which is an estimate of the pure overhead of the system call, apart from any actual work done by the call) went from about 100 ns on a typical modern Linux system to about 700 ns. Furthermore, depending on your system, the page-table isolation fix specifically for Meltdown can have additional downstream effects apart from the direct system call cost due to the need to reload TLB entries.
All of this is a relative disadvantage for read() based methods as compared to mmap based methods, since read() methods must make one system call for each "buffer size" worth of data. You can't arbitrarily increase the buffer size to amortize this cost since using large buffers usually performs worse since you exceed the L1 size and hence are constantly suffering cache misses.
On the other hand, with mmap, you can map in a large region of memory with MAP_POPULATE and the access it efficiently, at the cost of only a single system call.
1 This more-or-less also includes the case where the file wasn't fully cached to start with, but where the OS read-ahead is good enough to make it appear so (i.e., the page is usually cached by the time you want it). This is a subtle issue though because the way read-ahead works is often quite different between mmap and read calls, and can be further adjusted by "advise" calls as described in 2.
2 ... because if the file is not cached, your behavior is going to be completely dominated by IO concerns, including how sympathetic your access pattern is to the underlying hardware - and all your effort should be in ensuring such access is as sympathetic as possible, e.g. via use of madvise or fadvise calls (and whatever application level changes you can make to improve access patterns).
3 You could get around that, for example, by sequentially mmaping in windows of a smaller size, say 100 MB.
4 In fact, it turns out the MAP_POPULATE approach is (at least one some hardware/OS combination) only slightly faster than not using it, probably because the kernel is using faultaround - so the actual number of minor faults is reduced by a factor of 16 or so.
The main performance cost is going to be disk i/o. "mmap()" is certainly quicker than istream, but the difference might not be noticeable because the disk i/o will dominate your run-times.
I tried Ben Collins's code fragment (see above/below) to test his assertion that "mmap() is way faster" and found no measurable difference. See my comments on his answer.
I would certainly not recommend separately mmap'ing each record in turn unless your "records" are huge - that would be horribly slow, requiring 2 system calls for each record and possibly losing the page out of the disk-memory cache.....
In your case I think mmap(), istream and the low-level open()/read() calls will all be about the same. I would recommend mmap() in these cases:
There is random access (not sequential) within the file, AND
the whole thing fits comfortably in memory OR there is locality-of-reference within the file so that certain pages can be mapped in and other pages mapped out. That way the operating system uses the available RAM to maximum benefit.
OR if multiple processes are reading/working on the same file, then mmap() is fantastic because the processes all share the same physical pages.
(btw - I love mmap()/MapViewOfFile()).
mmap is way faster. You might write a simple benchmark to prove it to yourself:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
versus:
const int file_size=something;
const int page_size=0x1000;
int off=0;
void *data;
int fd = open("filename.bin", O_RDONLY);
while (off < file_size)
{
data = mmap(NULL, page_size, PROT_READ, 0, fd, off);
// do stuff with data
munmap(data, page_size);
off += page_size;
}
Clearly, I'm leaving out details (like how to determine when you reach the end of the file in the event that your file isn't a multiple of page_size, for instance), but it really shouldn't be much more complicated than this.
If you can, you might try to break up your data into multiple files that can be mmap()-ed in whole instead of in part (much simpler).
A couple of months ago I had a half-baked implementation of a sliding-window mmap()-ed stream class for boost_iostreams, but nobody cared and I got busy with other stuff. Most unfortunately, I deleted an archive of old unfinished projects a few weeks ago, and that was one of the victims :-(
Update: I should also add the caveat that this benchmark would look quite different in Windows because Microsoft implemented a nifty file cache that does most of what you would do with mmap in the first place. I.e., for frequently-accessed files, you could just do std::ifstream.read() and it would be as fast as mmap, because the file cache would have already done a memory-mapping for you, and it's transparent.
Final Update: Look, people: across a lot of different platform combinations of OS and standard libraries and disks and memory hierarchies, I can't say for certain that the system call mmap, viewed as a black box, will always always always be substantially faster than read. That wasn't exactly my intent, even if my words could be construed that way. Ultimately, my point was that memory-mapped i/o is generally faster than byte-based i/o; this is still true. If you find experimentally that there's no difference between the two, then the only explanation that seems reasonable to me is that your platform implements memory-mapping under the covers in a way that is advantageous to the performance of calls to read. The only way to be absolutely certain that you're using memory-mapped i/o in a portable way is to use mmap. If you don't care about portability and you can rely on the particular characteristics of your target platforms, then using read may be suitable without sacrificing measurably any performance.
Edit to clean up answer list:
#jbl:
the sliding window mmap sounds
interesting. Can you say a little more
about it?
Sure - I was writing a C++ library for Git (a libgit++, if you will), and I ran into a similar problem to this: I needed to be able to open large (very large) files and not have performance be a total dog (as it would be with std::fstream).
Boost::Iostreams already has a mapped_file Source, but the problem was that it was mmapping whole files, which limits you to 2^(wordsize). On 32-bit machines, 4GB isn't big enough. It's not unreasonable to expect to have .pack files in Git that become much larger than that, so I needed to read the file in chunks without resorting to regular file i/o. Under the covers of Boost::Iostreams, I implemented a Source, which is more or less another view of the interaction between std::streambuf and std::istream. You could also try a similar approach by just inheriting std::filebuf into a mapped_filebuf and similarly, inheriting std::fstream into a mapped_fstream. It's the interaction between the two that's difficult to get right. Boost::Iostreams has some of the work done for you, and it also provides hooks for filters and chains, so I thought it would be more useful to implement it that way.
I'm sorry Ben Collins lost his sliding windows mmap source code. That'd be nice to have in Boost.
Yes, mapping the file is much faster. You're essentially using the the OS virtual memory subsystem to associate memory-to-disk and vice versa. Think about it this way: if the OS kernel developers could make it faster they would. Because doing so makes just about everything faster: databases, boot times, program load times, et cetera.
The sliding window approach really isn't that difficult as multiple continguous pages can be mapped at once. So the size of the record doesn't matter so long as the largest of any single record will fit into memory. The important thing is managing the book-keeping.
If a record doesn't begin on a getpagesize() boundary, your mapping has to begin on the previous page. The length of the region mapped extends from the first byte of the record (rounded down if necessary to the nearest multiple of getpagesize()) to the last byte of the record (rounded up to the nearest multiple of getpagesize()). When you're finished processing a record, you can unmap() it, and move on to the next.
This all works just fine under Windows too using CreateFileMapping() and MapViewOfFile() (and GetSystemInfo() to get SYSTEM_INFO.dwAllocationGranularity --- not SYSTEM_INFO.dwPageSize).
mmap should be faster, but I don't know how much. It very much depends on your code. If you use mmap it's best to mmap the whole file at once, that will make you life a lot easier. One potential problem is that if your file is bigger than 4GB (or in practice the limit is lower, often 2GB) you will need a 64bit architecture. So if you're using a 32 environment, you probably don't want to use it.
Having said that, there may be a better route to improving performance. You said the input file gets scanned many times, if you can read it out in one pass and then be done with it, that could potentially be much faster.
Perhaps you should pre-process the files, so each record is in a separate file (or at least that each file is a mmap-able size).
Also could you do all of the processing steps for each record, before moving onto the next one? Maybe that would avoid some of the IO overhead?
I agree that mmap'd file I/O is going to be faster, but while your benchmarking the code, shouldn't the counter example be somewhat optimized?
Ben Collins wrote:
char data[0x1000];
std::ifstream in("file.bin");
while (in)
{
in.read(data, 0x1000);
// do something with data
}
I would suggest also trying:
char data[0x1000];
std::ifstream iifle( "file.bin");
std::istream in( ifile.rdbuf() );
while( in )
{
in.read( data, 0x1000);
// do something with data
}
And beyond that, you might also try making the buffer size the same size as one page of virtual memory, in case 0x1000 is not the size of one page of virtual memory on your machine... IMHO mmap'd file I/O still wins, but this should make things closer.
I remember mapping a huge file containing a tree structure into memory years ago. I was amazed by the speed compared to normal de-serialization which involves lot of work in memory, like allocating tree nodes and setting pointers.
So in fact I was comparing a single call to mmap (or its counterpart on Windows)
against many (MANY) calls to operator new and constructor calls.
For such kind of task, mmap is unbeatable compared to de-serialization.
Of course one should look into boosts relocatable pointer for this.
This sounds like a good use-case for multi-threading... I'd think you could pretty easily setup one thread to be reading data while the other(s) process it. That may be a way to dramatically increase the perceived performance. Just a thought.
To my mind, using mmap() "just" unburdens the developer from having to write their own caching code. In a simple "read through file eactly once" case, this isn't going to be hard (although as mlbrock points out you still save the memory copy into process space), but if you're going back and forth in the file or skipping bits and so forth, I believe the kernel developers have probably done a better job implementing caching than I can...
I think the greatest thing about mmap is potential for asynchronous reading with:
addr1 = NULL;
while( size_left > 0 ) {
r = min(MMAP_SIZE, size_left);
addr2 = mmap(NULL, r,
PROT_READ, MAP_FLAGS,
0, pos);
if (addr1 != NULL)
{
/* process mmap from prev cycle */
feed_data(ctx, addr1, MMAP_SIZE);
munmap(addr1, MMAP_SIZE);
}
addr1 = addr2;
size_left -= r;
pos += r;
}
feed_data(ctx, addr1, r);
munmap(addr1, r);
Problem is that I can't find the right MAP_FLAGS to give a hint that this memory should be synced from file asap.
I hope that MAP_POPULATE gives the right hint for mmap (i.e. it will not try to load all contents before return from call, but will do that in async. with feed_data). At least it gives better results with this flag even that manual states that it does nothing without MAP_PRIVATE since 2.6.23.