mmap versus memory allocated with new - c++

I have a BitVector class that can either allocate memory dynamically using new or it can mmap a file. There isn't a noticeable difference in performance when using it with small files, but when using a 16GB file I have found that the mmap file is far slower than the memory allocated with new. (Something like 10x slower or more.) Note that my machine has 64GB of RAM.
The code in question is loading values from a large disk file and placing them into a Bloom filter which uses my BitVector class for storage.
At first I thought this might be because the backing for the mmap file was on the same disk as the file I was loading from, but this didn't seem to be the issue. I put the two files on two physically different disks, and there was no change in performance. (Although I believe they are on the same controller.)
Then, I used mlock to try to force everything into RAM, but the mmap implementation was still really slow.
So, for the time being I'm just allocating the memory directly. The only thing I'm changing in the code for this comparison is a flag the BitVector constructor.
Note that to measure performance I'm both looking at top and watching how many states I can add into the Bloom filter per second. The CPU usage doesn't even register on top when using mmap - although jbd2/sda1-8 starts to move up (I'm running on an Ubuntu server), which looks to be a process that is dealing with journaling for the drive. The input and output files are stored on two HDDs.
Can anyone explain this huge difference in performance?
Thanks!

Just to start with, mmap is an system call or interface provided to access the Virtual Memory of the system.
Now, in linux (I hope you are working on *nix) a lot of performance improvement is acheived by lazy loading or more commonly known as Copy-On-Write.
For mmap as well, this kind of lazy loading is implemented.
What happens is, when you call mmap on a file, kernel does not immediately allocate main memory pages for the file to be mapped. Instead, it waits for the program to write/read from the illusionary page, at which stage, a page fault occurs, and the corresponding interrupt handler will then actually load that particular file part that can be held in that page frame (Also the page table is updated, so that next time, when you are reading/writing to same page, it is pointing to a valid frame).
Now, you can control this behavior with mlock, madvise, MAP_POPULATE flag with mmap etc.
MAP_POPULATE flags with mmap, tells the kernel to map the file to memory pages before the call returns rather than page faulting every time you access a new page.So, till the file is loaded, the function will be blocked.
From the Man Page:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file
mapping, this causes read-ahead on the file. Later accesses
to the mapping will not be blocked by page faults.
MAP_POPULATE is supported for private mappings only since
Linux 2.6.23.

Related

What could be the cause for this performance difference?

I'm constructing a cache file(~70MB for test) meant for spinning drives, there is a lot of random IO involved since I'm sorting things into it, somewhat alleviated by caching sequential items but I also have memory constraints.
Anyways, the difference appears between when I
a) freshly create the file and write it full of data ~100s
b) open the same file and write it full of data ~30s
I'm using memory mapped files to access them, when I freshly create a file I preallocate of course. I verified all the data, its accurate.
The data I'm writing is slightly different each time (something like 5% difference evenly distributed all over). Could it be that when I write to a mmf, and I overwrite something with the same data, it doesn't consider it a dirty page and thus doesn't actually write anything at all? How could it know?
Or perhaps there is some kind of write caching going on by windows or the hardware?
Try to trace the page faults. Or at least, try to monitor the page faults with process explorer for each write phase.
Anyway, when you open the same file with write access, the file is "recreated", but in memory the existing mapped paged are kept as it. Then, during the writing, if the data is binary the same inside a whole page (usually 4k per page so statistically this can happen with your data), the content of the page will not be flagged as "updated". So when closing file, no flush occurs for some pages that's why you see a big difference of performance.

What is up with memory mapped files and actual memory usage?

Cant really find any specifics on this, heres all I know about mmf's in windows:
Creating a memory mapped file in windows adds nothing to the apparent amount of memory a program uses
Creating a view to that file consumes memory equivalent to the view size
This looks rather backwards to me, since for one, I know that the mmf itself actually has memory...somewhere. If I write something in a mmf and destroy the view, the data is still there. Meanwhile, why does the view take any memory at all? Its just a pointer, no?
Then theres the weirdness with whats actually in the ram and whats on the disk. In large mmf's with a distributed looking access pattern, sometimes the speed is there and sometimes its not. I'm guessing some of it gets sometimes stored in the file if one is tied to it or the paging file but really, I have no clue.
Anyways, the problem that drove me to investigate this is that I have a ~2gb file that I want multiple programs to share. I can't create a 2gb view in each of them since I'm just "out of memory" so I have to create/destroy smaller ones. This creates a lot of overhead due to additional offset calculations and the creation of the view itself. Can anybody explain to me why it is like this?
On a demand-paged virtual memory operating system like Windows, the view of an MMF occupies address space. Just numbers to the processor, one for each 4096 bytes. You only start using RAM until you actually use the view. Reading or writing data. At which point you trigger a page fault and force the OS to map the virtual memory page to physical memory. The "demand-paged" part.
You can't get a single chunk of 2 GB of address space in a 32-bit process since there would not be room for anything else. The limit is the largest hole in the address space between other allocations for code and data, usually hovers around ~650 megabytes, give or take. You'll need to target x64. Or build an x86 program that's linked with /LARGEADDRESSAWARE and runs on a 64-bit operating system. A backdoor which is getting to be pretty pointless these days.
The thing in memory mapped file is that it lets you manipulate its data without I/O calls. Because of this behavior, when you access the file, windows loads it to the physical memory, so it can be manipulated in it rather than on the disk. You can read more about this in here: http://blogs.msdn.com/b/khen1234/archive/2006/01/30/519483.aspx
Anyways, the problem that drove me to investigate this is that I have a ~2gb file that I want multiple programs to share. I can't create a 2gb view in each of them since I'm just "out of memory" so I have to create/destroy smaller ones.
The most likely cause is that the programs are 32-bit. 32-bit programs (by default) only have 2GB of address space so you can't map a 2GB file in a single view. If you rebuild them in 64-bit mode, the problem should go away.

Do memory mapped files provide advantage for large buffers?

My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.
Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.
I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.
Yes, mmap has the potential to speed things up.
Things to consider:
Remember the VMM will page things in and out in page size blocked (4k on Linux)
If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
Consider using madvise with mmap to assist the VMM in paging well.
When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.
If the data is already in a file, it would speed up things, especially in the non-sequential case. (In the sequential case, read wins)
If using open and read, consider using posix_fadvise as well.
This really depends on your mmap() implementation. Mapping a file into memory has several advantages that can be exploited by the kernel:
The kernel knows that the contents of the mmap() pages is already present on disk. If it decides to evict these pages, it can omit the write back.
You reduce copying operations: read() operations typically first read the data into kernel memory, then copy it over to user space.
The reduced copies also mean that less memory is used to store data from the file, which means more memory is available for other uses, which can reduce paging as well.
This is also, why it is generally a bad idea to use large caches within an I/O library: Modern kernels already cache everything they ever read from disk, caching a copy in user space means that the amount of data that can be cached is actually reduced.
Of course, you also avoid a lot of headaches that result from buffering data of unknown size in your application. But that is just a convenience for you as a programmer.
However, even though the kernel can exploit these properties, it does not necessarily do so. My experience is that LINUX mmap() is generally fine; on AIX, however, I have witnessed really bad mmap() performance. So, if your goal is performance, it's the old measure-compare-decide stand by.

how to cache 1000s of large C++ objects

Environment:
Windows 8 64 bit, Windows 2008 server 64 bit
Visual Studio (professional) 2012 64 bits
list L; //I have 1000s of large CMyObject in my program that I cache, which is shared by different threads in my windows service program.
For our SaaS middleware product, we cache in memory 1000s of large C++ objects (read only const objects, each about 4MB in size), which runs the system out of memory. Can we associate a disk file (or some other persistent mechanism that is OS managed) to our C++ objects? There is no need for sharing / inter-process communication.
The disk file will suffice if it works for the duration of the process (our windows service program). The read-only const C++ objects are shared by different threads in the same windows service.
I was even considering using object databases (like mongoDB) to store the objects, which will then be loaded / unloaded at each use. Though faster than reading our serialized file (hopefully), it will still spoil the performance.
The purpose is to retain caching of C++ objects for performance reason and avoid having to load / unload the serialized C++ object every time. It would be great if this disk file is OS managed and requires minimal tweaking in our code.
Thanks in advance for your responses.
The only thing which is OS managed in the manner you describe is swap file. You can create a separate application (let it be called "cache helper"), which loads all the objects into memory and waits for requests. Since it does not use it's memory pages, OS will eventually displace the pages to the swap file, recalling it only if/when needed.
Communication with the applciation can be done through named pipes or sockets.
Disadvantages of such approach are that the performance of such cache will be highly volatile, and it may degrade performance of the whole server.
I'd recommend to write your own caching algorithm/application, as you may later need to adjust its properties.
One solution is of course to simply load every object, and let the OS deal with swapping it in from/out to disk as required. (Or dynamically load, but never discard unless the object is absolutely being destroyed). This approach will work well if there are are number of objects that are more frequently used than others. And the loading from swapspace is almost certainly faster than anything you can write. The exception to this is if you do know beforehand what objects are more likely or less likely to be used next, and can "throw out" the right objects in case of low memory.
You can certainly also use a memory mapped file - this will allow you to read from and write to the file as if it was memory (and the OS will cache the content in RAM as memory is available). On WIndows, you will be using CreateFileMapping or OpenFileMapping to create/open the filemapping, and then MapViewOfFile to map the file into memory. When finished, use UnmapViewOfFile to "unmap" the memory, and then CloseHandle to close the FileMapping.
The only worry about a filemapping is that it may not appear at the same address in memory next time around, so you can't have pointers within the filemapping and load the same data as binary next time. It would of course work fine to create a new filemapping each time.
So your thousands of massive objects have constructor, destructor, virtual functions and pointers. This means you can't easily page them out. The OS can do it for you though, so your most practical approach is simply to add more physical memory, possibly an SSD swap volume, and use that 64-bit address space. (I don't know how much is actually addressable on your OS, but presumably enough to fit your ~4G of objects).
Your second option is to find a way to just save some memory. This might be using a specialized allocator to reduce slack, or removing layers of indirection. You haven't given enough information about your data for me to make concrete suggestions on this.
A third option, assuming you can fit your program in memory, is simply to speed up your deserialization. Can you change the format to something you can parse more efficiently? Can you somehow deserialize objects quickly on-demand?
The final option, and the most work, is to manually manage a swapfile. It would be sensible as a first step to split your massive polymorphic classes into two: a polymorphic flyweight (with one instance per concrete subtype), and a flattened aggregate context structure. This aggregate is the part you can swap in and out of your address space safely.
Now you just need a memory-mapped paging mechanism, some kind of cache tracking which pages are currently mapped, possibly a smart pointer replacing your raw pointer with a page+offset which can map data in on-demand, etc. Again, you haven't given enough information on your data structure and access patterns to make more detailed suggestions.

Performance of Win32 memory mapped files vs. CRT fopen/fread

I need to read (scan) a file sequentially and process its content.
File size can be anything from very small (some KB) to very large (some GB).
I tried two techniques using VC10/VS2010 on Windows 7 64-bit:
Win32 memory mapped files (i.e. CreateFile, CreateFileMapping, MapViewOfFile, etc.)
fopen and fread from CRT.
I thought that memory mapped file technique could be faster than CRT functions, but some tests showed that the speed is almost the same in both cases.
The following C++ statements are used for MMF:
HANDLE hFile = CreateFile(
filename,
GENERIC_READ,
FILE_SHARE_READ,
NULL,
OPEN_EXISTING,
FILE_FLAG_SEQUENTIAL_SCAN,
NULL
);
HANDLE hFileMapping = CreateFileMapping(
hFile,
NULL,
PAGE_READONLY,
0,
0,
NULL
);
The file is read sequentially, chunk by chunk; each chunk is SYSTEM_INFO.dwAllocationGranularity in size.
Considering that speed is almost the same with MMF and CRT, I'd use CRT functions because they are simpler and multi-platform. But I'm curious: am I using the MMF technique correctly? Is it normal that MMF performance in this case of scannig file sequentially is the same as CRT one?
Thanks.
I believe you'll not see much difference if you access the file sequentially. Because file I/O is very heavily cached, + read-ahead is probably also used.
The thing would be different if you had many "jumps" during the file data processing. Then, each time setting a new file pointer and reading a new file portion will probably kill CRT, whereas MMF will give you the maximum possible performance
Since you are scanning the file sequentially I would not expect disk usage pattern to be much different for either method.
For large files, MMF might reduce data locality and even result in a copy of all or part of the file being placed in the pagefile, whereas processing via CRT using a small buffer would all take place in RAM. In this instance, MMF would probably be slower. You can mitigate this by only mapping in part of the underlying file at a time, but then things get more complex without any likely win over direct sequential I/O.
MMF are really the way Windows implements inter-process shared memory, rather than a way to speed up generalized file I/O. The file manager cache in the kernel is what you really need to leverage here.
I thought that memory mapped file
technique could be faster than CRT
functions, but some tests showed that
the speed is almost the same in both
cases.
You are probably hitting the file system cache for your tests. Unless you explicitly create file handles to bypass the file system cache (FILE_FLAG_NO_BUFFERING when calling CreateFile), the file system cache will kick in and keep recently accessed files in memory.
There is a small speed difference between reading a file that is in the file system cache with buffering turned on, as the operating system has to perform an extra copy, as well as system call overhead. But for your purposes, you should probably stick with the CRT file functions.
Gustavo Duarte has a great article on memory mapped files (from a generic OS perspective).
Both methods will eventually come down to disk i/o, that will be your bottleneck. I would go with one method that my higher level functionality likes more - if i have need streaming, I'll go with files, if I need sequential access and fixed size files, I would consider memory mapped files.
Or, in case when you have an algorithm that works only on memory, then mem-mapped files can be easier way out.
Using ReadFile:
Enters Kernel Mode
Does a memcpy from the Disk Cache
If data isn't in the Disk Cache, triggers a Page Fault which makes the Cache Manager read data from the disk.
Exits Kernel Mode
Cost of entering and leaving Kernel Mode was about 1600 CPU cycles when I measured it.
Avoid small reads, since every call to ReadFile has the overhead of entering and leaving Kernel Mode.
Memory Mapped Files:
Basically places the Disk Cache right into your application's address space.
If data is in the cache, you just read it.
If data isn't there, triggers a Page Fault that makes the Cache Manager read data from the disk. (There is a User/Kernel mode transition to handle this exception)
Disk reads don't always succeed. You need to be able to handle memory exceptions from the system, otherwise a disk read failure will be an application crash.
So both ways will use the same Disk Cache, and will use the same mechanism of getting data into the cache (Page Fault exception -> Cache Manager reads data from the disk). The Cache Manager is also responsible for doing data prefetching and such, so it can read more than one page at a time. You don't get a page fault on every memory page.
So the main advantages of Memory-Mapped files are:
Can possibly use data in-place without copying it out first
Fewer User<->Kernel Mode transitions (depends on access patterns)
And the disadvantages are:
Need to handle access violation exceptions for failed disk reads
Takes up address space in the program to map entire files