Read with File Mapping Objects in C++ - c++

I am trying to use Memory Mapped File (MMF) to read my .csv data file (very large and time consuming).
I've heared that MMF is very fast since it caches content of the file, thus users can get access to the content in disk as in memory.
May I know if MMF is any faster than using other reading methods?
If this is true, can anyone show me a simple example how to read a file from disk?
Many thanks in advance.

May I know if MMF is any faster than using other reading methods?
If you're reading the entire file sequentially in one pass, then a memory-mapped file is probably approximately the same as using conventional file I/O.
can anyone show me a simple example how to read a file from disk?
Memory mapped files are typically an operating system feature, so you'd have to tell us which platform you're on to get an example of using it.
If you want to read a file sequentially, you can use the C++ ifstream class or the C run-time functions like fopen, fread, and fclose.

If it's faster or not depends on many different factors (such as what data you are accessing, how you are accessing it, etc. To determine what is right for YOUR case, you need to benchmark different solutions, and see what is best in your case.
The main benefit of memory mapped files is that the data can be copied directly from the filesystem into the user-accessible memory.
In traditional (fstream::read(), fredad(), etc) types of file-reading, the content of the file is read into a temporary buffer in the OS, then (part of) that buffer is copied to the user supplied buffer. This is because the OS can't rely on the memory being there and it gets pretty messy pretty quickly. For memory mapped files, the OS knows directly where the memory is for the different sections (because it's the OS's task to assign that memory and keep track of where it is!) of the file, so the OS can just copy it straight in.
However, I strongly suspect that the method of reading the file is a minor part, and the actual interpretation/parsing/copying out of the file may well be a large part. [Speculation, we haven't seen your code, of course]. And of course, the I/O speed available from the DISK itself may play a large factor if the file is very large.

Related

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?
Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

RAM consumption on opening a file

I have a Binary file of ~400MB which I want to convert to CSV format. The output CSV file will be ~1GB (according to my calculations).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV, I am creating a file (or opening an existing file - depending on the user's choice), opening it using fopen and then writing to it using fwrite, line by line.
Coming to my question, this link from CPlusPlus.com says:
The returned stream is fully buffered by default if it is known to not
refer to an interactive device
My query is when I open this file, will it be loaded in RAM? Like when at the end, my file is of ~1GB, will it consume that much RAM or will it be just on the hard disk?
This code will run on Windows as well as Android.
FILE* streams buffering is a C feature and it is used to reduce system call overhead (i.e. do not call read() for each fgetc() which is expensive). Usually buffer is small - i.e. 512 bytes.
Page Cache or similiar mechanisms are different beasts -- they are used to reduce number of disks operations. Usually operating system uses free memory to cache previously read or written data to/from disk so subsequent operations will use RAM.
If there are shortage of free memory -- data is evicted from page cache.
It is operating system and file system and computer specific. And it might not matter that much. Read about page cache.
BTW, you might be interested by sqlite
From an application writer point of view, you should care more about virtual memory and address space of your process than about RAM. Physical RAM is managed by the operating system.
On Linux and Android, if you want to optimize that you might consider (later) using posix_fadvise(2) and perhaps madvise(2). I'm not sure it is worth the pain in your case (since a gigabyte file is not that much today).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV
Reading per se doesn't use a lot of memory, like myaut says the buffer is small. The elephant in the room here is: do you you read up all the file and put all the data into structures? or do you start processing after one or few reads to get the minimum amount of data needed to do some processing? Doing the former will indeed use ~400MB or more memory, doing the later will use quite a lot less, that being said, it all depends on the amount of data needed to start processing, and maybe you need all the data loaded at once.

Windows C++ Lock file in memory

If I need to read from a file very often, and I will load the file into a vector of unsigned char using fread, the consequent fread are really fast, even if the vector of unsigned char is destroy right after reading.
It seems to me that something (Windows or the disk) caches the file and thus freads are very fast. I have not read anything about this behaviour, so I am unsure what really causes this.
If I don't use my application for 1 hour or so and then do an fread again, the fread is slow.
It seems to me that the cache got emptied.
Can somebody explain this behaviour to me? I would like to actively use it.
It is a problem for me when the freads are slow.
Memory-mapping the file works theoretically, but the file itself is too big, so I can not use it.
90/10 law
90% of the execution time of a computer program is spent executing 10% of the code
It is not a rule but usually it is so, so lots of programs tries to keep recent data if possible because it is very likely that that data will be accessed very soon again.
Windows OS is not an exception, after receiving command to read file OS keeps some data about file. It stores in memory addresses of ages where the program is stored, if possible even store some part (or even all) of binary data in memory, it makes next file read much faster if that read is just after the first-one.
All-in-all you are right that there is caching, but I can't to say, that is really going on as I'm not working in Microsoft...
Also answering into next part of question. File mapping into memory may be solution but if the file is very large machine may not have stat much memory so it wouldn't be an option. However, you can use the 90/10 law. In your case you should have just a part of file mapped into memory (that part that is the most important), also while reading you should make a data table of overall parameters.
Don't know exact situation, but it may save.

Reading and writing binary files in Qt

I'm going to be working with binary files in a Qt project, and, being a little new to Qt, am not sure whether or not I should use a QVector<quint8> or a QByteArray to store the data. The files could be very small (< 1MiB) or very large (> 4GiB). The size is unknown until runtime.
I need to be able to have random seek and be able to process operations on every byte in the file. Would a memory mapped file be of any use here?
Thank you for any suggestions.
Loading whole large files in memory, be it QVector or QByteArray is probably not a good solution.
Assuming the files have some kind of structure, you should use QFile::seek to position yourself at the start of a "record" and use qint64 QIODevice::read ( char * data, qint64 maxSize ) to read one record at a time in a buffer of your choice.
QIODevice::write has an overload for QByteArray if that affects your decision. QDataStream may be worth looking at for larger data. At the end of the day it's really up to you as various containers would work.
Edit:
I think basic file I/O using whatever buffer you prefer is probably all you need. Use objects like QFile, QDataStream, QByteArray, etc. You could read and process only parts of the file with circular buffers to save memory especially if dealing with audio, video, or something that lends itself to streams. If there is a known structure to the file like XML, CSV, etc that also makes partial reads and processing easier as you can go line by line or tag by tag.
Memory mapped files use virtual memory to achieve faster I/O by basically making a copy of a file on disk in a virtual memory segment which is then capable of being used by the application as if it was just process memory. Being able to treat a file as process memory allows you to do in place editing which is faster than having to seek to a position from the beginning of a file and faster than making OS dependent API calls and dealing with hard disk read/writes. There does tend to be a fair amount of overhead to memory mapped files and there are some possible limitations depending on how paging is implemented in your target platform or what architecture you're using. In Qt you would have to design your own objects to use memory mapped files and historically I believe linux systems support this functionality better than windows.

Is there a difference between boost iostream mapped file and boost interprocess mapped file?

I want to create a mapped binary file into memory; however I am not sure how to create the file to be mapped into the system. I read the documentation several times and realize there are 2 mapped file implementations, one in iostream and the other in interprocess.
Do you guys have any idea on how to create a mapped file into shared memory? I am trying to allow a multi-threaded program to read an array of large double written in a binary file format. Also what is the difference between the mapped file in iostream and interprocess?
As far as I can tell, iostreams will place the mapped file in shared memory (this is what you want); however, interprocess however places the file in a another process's address space.
You should probably use iostreams unless you have multiple processes (not threads) that will be communicating with each other in some way.
The chief difference I see between the two is how they are used. In boost-interprocess, to use a memory mapped file, you create objects in that memory space using placement new, and those objects are automatically persistent in binary form in your file. Other processes can then map the same file and use those objects, or the program itself can use it as a persistent cache and reload them later. Memory mapped files in boost-iostreams act just like file streams, with the added benefits of being a boost::iostream, and would provide stream semantics to interprocess communication.
For a single process, there isn't much benefit to using boost::iostream memory mapped files. However, it can reduce the latency in working with the file as it will have already been loaded into memory. But, you only gain this benefit if you are constantly rewriting parts of the file. For a single read/write pass of the file, there may not be any speed up.