I am writing a program that needs to traverse a large 40gb binary file, but I only have 16gb of physical RAM. A friend told me that I can use file mapping to allievate this problem. I understand how to create a file mapping and reading into a file map handle, and how file mapping maps parts of a file in persistent memory to different chunks of virtual memory for reading.
So if I am understanding this correctly, I can create a buffer of say 10gb, and read the first 10gb of the file into this buffer. But When I have to read past the 10gb mark on the file, will the OS fetch another block automatically for me, or do I have to manually do so in my code?
The functions you linked to aren't (directly) related to file mapping. They're used for regular file I/O.
To use traditional file I/O with a really large file, you could do as you described. You would open the file, create a buffer, and read a chunk of the file into the buffer. When you need to access a different part of the file, you read a different chunk into the buffer.
To use a file mapping, you use CreateFile, CreateFileMapping, and then MapViewOfFile. You don't (directly) create a buffer and read a part of the file into it. Instead, you tell the system that you want to access a range of the file as though it were a range of memory addresses. Reads and writes to those addresses are turned into file i/o operations behind the scenes. In this approach you might still have to work in chunks. If the part of the file you need to access is not in the range you currently have mapped, you can create another view (and possibly close the other one).
But note that I said address space, which is different than RAM. If you're building for 64-bit Windows, you can try to map the entire 40 GB file into your address space. The fact that you have only 16 GB of RAM won't stop you. There may be some other problems at that size, but it won't be because of your RAM. If there are other problems, you'll be back to managing the file in chunks as before.
Related
I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?
Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.
I have a Binary file of ~400MB which I want to convert to CSV format. The output CSV file will be ~1GB (according to my calculations).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV, I am creating a file (or opening an existing file - depending on the user's choice), opening it using fopen and then writing to it using fwrite, line by line.
Coming to my question, this link from CPlusPlus.com says:
The returned stream is fully buffered by default if it is known to not
refer to an interactive device
My query is when I open this file, will it be loaded in RAM? Like when at the end, my file is of ~1GB, will it consume that much RAM or will it be just on the hard disk?
This code will run on Windows as well as Android.
FILE* streams buffering is a C feature and it is used to reduce system call overhead (i.e. do not call read() for each fgetc() which is expensive). Usually buffer is small - i.e. 512 bytes.
Page Cache or similiar mechanisms are different beasts -- they are used to reduce number of disks operations. Usually operating system uses free memory to cache previously read or written data to/from disk so subsequent operations will use RAM.
If there are shortage of free memory -- data is evicted from page cache.
It is operating system and file system and computer specific. And it might not matter that much. Read about page cache.
BTW, you might be interested by sqlite
From an application writer point of view, you should care more about virtual memory and address space of your process than about RAM. Physical RAM is managed by the operating system.
On Linux and Android, if you want to optimize that you might consider (later) using posix_fadvise(2) and perhaps madvise(2). I'm not sure it is worth the pain in your case (since a gigabyte file is not that much today).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV
Reading per se doesn't use a lot of memory, like myaut says the buffer is small. The elephant in the room here is: do you you read up all the file and put all the data into structures? or do you start processing after one or few reads to get the minimum amount of data needed to do some processing? Doing the former will indeed use ~400MB or more memory, doing the later will use quite a lot less, that being said, it all depends on the amount of data needed to start processing, and maybe you need all the data loaded at once.
I am trying to use Memory Mapped File (MMF) to read my .csv data file (very large and time consuming).
I've heared that MMF is very fast since it caches content of the file, thus users can get access to the content in disk as in memory.
May I know if MMF is any faster than using other reading methods?
If this is true, can anyone show me a simple example how to read a file from disk?
Many thanks in advance.
May I know if MMF is any faster than using other reading methods?
If you're reading the entire file sequentially in one pass, then a memory-mapped file is probably approximately the same as using conventional file I/O.
can anyone show me a simple example how to read a file from disk?
Memory mapped files are typically an operating system feature, so you'd have to tell us which platform you're on to get an example of using it.
If you want to read a file sequentially, you can use the C++ ifstream class or the C run-time functions like fopen, fread, and fclose.
If it's faster or not depends on many different factors (such as what data you are accessing, how you are accessing it, etc. To determine what is right for YOUR case, you need to benchmark different solutions, and see what is best in your case.
The main benefit of memory mapped files is that the data can be copied directly from the filesystem into the user-accessible memory.
In traditional (fstream::read(), fredad(), etc) types of file-reading, the content of the file is read into a temporary buffer in the OS, then (part of) that buffer is copied to the user supplied buffer. This is because the OS can't rely on the memory being there and it gets pretty messy pretty quickly. For memory mapped files, the OS knows directly where the memory is for the different sections (because it's the OS's task to assign that memory and keep track of where it is!) of the file, so the OS can just copy it straight in.
However, I strongly suspect that the method of reading the file is a minor part, and the actual interpretation/parsing/copying out of the file may well be a large part. [Speculation, we haven't seen your code, of course]. And of course, the I/O speed available from the DISK itself may play a large factor if the file is very large.
I am using memory mapped files to have read-/write-access to a large number of image files (~10000 x 16 MB) under Windows 7 64bit. My goals are:
Having as much data cached as possible.
Being able to allocate new images and write to those as fast as possible.
Therefore I am using memory mapped files to access the files. Caching works well, but the OS is not flushing dirty pages until I am nearly out of physical memory. Because of that allocating and writing to new files is quite slow once the physical memory is filled.
One solution would be to regularly use FlushViewOfFile(), but this function does not return until the data has been writen to disk.
Is there a way to asynchroniously flush a file mapping? The only solution I found is to Unmap() and MapViewOfFile() again, but using this approach I can not be sure to get the same data pointer again. Can someone suggest a better approach?
Edit:
Reading the WINAPI documentation a little longer, it seems that I found a suitable solution to my problem:
Calling VirtualUnlock() on a memory range that is not locked results in a flushing of dirty pages.
I heard that FlushViewOfFile() function does NOT wait until it physically write to file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366563(v=vs.85).aspx
The FlushViewOfFile function does not flush the file metadata, and it does not wait to return until the changes are flushed from the underlying hardware disk cache and physically written to disk.
After call "FlushFileBuffers( ... )" then your data will be physically written to disk.
I want to create a mapped binary file into memory; however I am not sure how to create the file to be mapped into the system. I read the documentation several times and realize there are 2 mapped file implementations, one in iostream and the other in interprocess.
Do you guys have any idea on how to create a mapped file into shared memory? I am trying to allow a multi-threaded program to read an array of large double written in a binary file format. Also what is the difference between the mapped file in iostream and interprocess?
As far as I can tell, iostreams will place the mapped file in shared memory (this is what you want); however, interprocess however places the file in a another process's address space.
You should probably use iostreams unless you have multiple processes (not threads) that will be communicating with each other in some way.
The chief difference I see between the two is how they are used. In boost-interprocess, to use a memory mapped file, you create objects in that memory space using placement new, and those objects are automatically persistent in binary form in your file. Other processes can then map the same file and use those objects, or the program itself can use it as a persistent cache and reload them later. Memory mapped files in boost-iostreams act just like file streams, with the added benefits of being a boost::iostream, and would provide stream semantics to interprocess communication.
For a single process, there isn't much benefit to using boost::iostream memory mapped files. However, it can reduce the latency in working with the file as it will have already been loaded into memory. But, you only gain this benefit if you are constantly rewriting parts of the file. For a single read/write pass of the file, there may not be any speed up.