Reading and writing binary files in Qt - c++

I'm going to be working with binary files in a Qt project, and, being a little new to Qt, am not sure whether or not I should use a QVector<quint8> or a QByteArray to store the data. The files could be very small (< 1MiB) or very large (> 4GiB). The size is unknown until runtime.
I need to be able to have random seek and be able to process operations on every byte in the file. Would a memory mapped file be of any use here?
Thank you for any suggestions.

Loading whole large files in memory, be it QVector or QByteArray is probably not a good solution.
Assuming the files have some kind of structure, you should use QFile::seek to position yourself at the start of a "record" and use qint64 QIODevice::read ( char * data, qint64 maxSize ) to read one record at a time in a buffer of your choice.

QIODevice::write has an overload for QByteArray if that affects your decision. QDataStream may be worth looking at for larger data. At the end of the day it's really up to you as various containers would work.
Edit:
I think basic file I/O using whatever buffer you prefer is probably all you need. Use objects like QFile, QDataStream, QByteArray, etc. You could read and process only parts of the file with circular buffers to save memory especially if dealing with audio, video, or something that lends itself to streams. If there is a known structure to the file like XML, CSV, etc that also makes partial reads and processing easier as you can go line by line or tag by tag.
Memory mapped files use virtual memory to achieve faster I/O by basically making a copy of a file on disk in a virtual memory segment which is then capable of being used by the application as if it was just process memory. Being able to treat a file as process memory allows you to do in place editing which is faster than having to seek to a position from the beginning of a file and faster than making OS dependent API calls and dealing with hard disk read/writes. There does tend to be a fair amount of overhead to memory mapped files and there are some possible limitations depending on how paging is implemented in your target platform or what architecture you're using. In Qt you would have to design your own objects to use memory mapped files and historically I believe linux systems support this functionality better than windows.

Related

C++: will this disk seek take a very large performance hit?

I am using the STL fstream utilities to read from a file. However, what I would like to do is is read a specified number of bytes and then seek back some bytes and read again from that position. So, it is sort of an overlapped read. In code, this would look as follows:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
size_t read_num = 0;
size_t windows_size = 200;
while (read_num < total_num)
{
char buffer[1024];
size_t num_bytes_read = fileStream.read(buffer, sizeof(buffer));
read_num += num_bytes_read - 200;
filestream.seekg(read_num);
}
This is the not the only way to solve my problem but will make multi-tasking a breeze (I have been looking at other data structures like circular buffers but that will make multitasking difficult). I was wondering if I can have your input on how much of a performance hit these seek operations might take when processing very large files. I will only ever use one thread to read the data from file.
The files contain large sequence of texts only characters from the set {A,D,C,G,F,T}. Would it also be advisable to open it as a binary file rather than in text mode as I am doing?
Because the file is large, I am also opening it in chucks with the chuck being set to a 32 MB block. Would this be too large to take advantage of any caching mechanism?
On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).
As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.
Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.
If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).
At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.
Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.
On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).
Plasibly, yes. Whenever you seek, the cached file data for that file is (likely to be) discarded, causing extra overhead of, at least, a system call to fetch the data again.
Assuming the file isn't enormous, it MAY be a better choice to read the entire file into memory (or, if you don't need portability, use a memory mapped file, at which point caching of the file content is trivial - again assuming the entire file fits in (virtual) memory).
However, all this is implementation dependent, so measuring performance of each method would be essential - it's only possible to KNOW these things for a given system by measuring, it's not something you can read about and get precise information on the internet (not even here on SO), because there are a whole bunch of factors that affect the behaviour.

RAM consumption on opening a file

I have a Binary file of ~400MB which I want to convert to CSV format. The output CSV file will be ~1GB (according to my calculations).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV, I am creating a file (or opening an existing file - depending on the user's choice), opening it using fopen and then writing to it using fwrite, line by line.
Coming to my question, this link from CPlusPlus.com says:
The returned stream is fully buffered by default if it is known to not
refer to an interactive device
My query is when I open this file, will it be loaded in RAM? Like when at the end, my file is of ~1GB, will it consume that much RAM or will it be just on the hard disk?
This code will run on Windows as well as Android.
FILE* streams buffering is a C feature and it is used to reduce system call overhead (i.e. do not call read() for each fgetc() which is expensive). Usually buffer is small - i.e. 512 bytes.
Page Cache or similiar mechanisms are different beasts -- they are used to reduce number of disks operations. Usually operating system uses free memory to cache previously read or written data to/from disk so subsequent operations will use RAM.
If there are shortage of free memory -- data is evicted from page cache.
It is operating system and file system and computer specific. And it might not matter that much. Read about page cache.
BTW, you might be interested by sqlite
From an application writer point of view, you should care more about virtual memory and address space of your process than about RAM. Physical RAM is managed by the operating system.
On Linux and Android, if you want to optimize that you might consider (later) using posix_fadvise(2) and perhaps madvise(2). I'm not sure it is worth the pain in your case (since a gigabyte file is not that much today).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV
Reading per se doesn't use a lot of memory, like myaut says the buffer is small. The elephant in the room here is: do you you read up all the file and put all the data into structures? or do you start processing after one or few reads to get the minimum amount of data needed to do some processing? Doing the former will indeed use ~400MB or more memory, doing the later will use quite a lot less, that being said, it all depends on the amount of data needed to start processing, and maybe you need all the data loaded at once.

Which type of file access to use?

I'm building an io framework for a custom file format for windows only. The files are anywhere between 100MB and 100GB. The reads/writes come in sequences of few hundred KB to a couple MB in unpredictable locations. Read speed is most critical, though, cpu use might trump that since I hear fstream can put a real dent in it when working with SSDs.
Originally I was planning to use fstream but as I red a bit more into file IO, I discovered a bunch of other options. Since I have little experience with the topic, I'm torn as to which method to use. The options I've scoped out are fstream, FILE and mapped file.
In my research so far, all I've found is a lot of contradictory benchmark results depending on chunk sizes, buffer sizes and other bottlenecks i don't understand. It would be helpful if somebody could clarify the advantages/drawbacks between those options.
In this case the bottleneck is mainly your hardware and not the library you use, since the blocks you are reading are relatively big 200KB - 5MB (compared to the sector size) and sequential (all in one).
With hard disks (high seek time) it could have a sense to read more data than needed to optimize the caching. With SSDs I would not use big buffers but read only the exact required data since the seek time is not a big issue.
Memory Mapped files are convenient for complete random access to your data, especially in small chunks (even few bytes). But it takes more code to setup a memory mapped file. On 64 bit systems you may could map the whole file (virtually) and then let the OS caching system optimize the reads (multiple access to the same data). You can then return just the pointer to the required memory location without even the need to have temporary buffers or use memcpy. It would be very simple.
fstream gives you extra features compared to FILE which I think aren't much of use in your case.

Read with File Mapping Objects in C++

I am trying to use Memory Mapped File (MMF) to read my .csv data file (very large and time consuming).
I've heared that MMF is very fast since it caches content of the file, thus users can get access to the content in disk as in memory.
May I know if MMF is any faster than using other reading methods?
If this is true, can anyone show me a simple example how to read a file from disk?
Many thanks in advance.
May I know if MMF is any faster than using other reading methods?
If you're reading the entire file sequentially in one pass, then a memory-mapped file is probably approximately the same as using conventional file I/O.
can anyone show me a simple example how to read a file from disk?
Memory mapped files are typically an operating system feature, so you'd have to tell us which platform you're on to get an example of using it.
If you want to read a file sequentially, you can use the C++ ifstream class or the C run-time functions like fopen, fread, and fclose.
If it's faster or not depends on many different factors (such as what data you are accessing, how you are accessing it, etc. To determine what is right for YOUR case, you need to benchmark different solutions, and see what is best in your case.
The main benefit of memory mapped files is that the data can be copied directly from the filesystem into the user-accessible memory.
In traditional (fstream::read(), fredad(), etc) types of file-reading, the content of the file is read into a temporary buffer in the OS, then (part of) that buffer is copied to the user supplied buffer. This is because the OS can't rely on the memory being there and it gets pretty messy pretty quickly. For memory mapped files, the OS knows directly where the memory is for the different sections (because it's the OS's task to assign that memory and keep track of where it is!) of the file, so the OS can just copy it straight in.
However, I strongly suspect that the method of reading the file is a minor part, and the actual interpretation/parsing/copying out of the file may well be a large part. [Speculation, we haven't seen your code, of course]. And of course, the I/O speed available from the DISK itself may play a large factor if the file is very large.

Write a large file to disk from RAM

If I need to write a large file from allocated memory to disk, what is the most efficient way to do it?
Currently I use something along the lines of:
char* data = static_cast<char*>(operator new(0xF00000000)); // 60 GB
// Do something to fill `data` with data
std::ofstream("output.raw", std::ios::binary).
write(data, 0xF00000000);
But I am not sure if the most straightforward way is also the most efficient, taking into account various buffering mechanisms and alike.
I am using Windows 7 64-bit and Visual Studio 2012 RC compiler with 64-bit target.
For Windows, you should use CreateFile API. Have a good read of that page and any links from it mentioning optimization. There are some flags you pass in to turn off buffering. I did this in the past when I was collecting video at about 800MB per second, and having to write off small parts of it as fast as possible to a RAID array.
Now, for the flags - I think it's primarily these:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
For reading, you may want to use FILE_FLAG_SEQUENTIAL_SCAN, although I think this has no effect if buffering is turned off.
Have a look at the Caching Behaviour section
There's a couple of things you need to do. Firstly, you should always write amounts of data that are a multiple of the sector size. This is (or at least was) 512 bytes almost universally, but you may want to consider up to 2048 in future.
Secondly, your memory has to be aligned to that sector size too. You can either use _aligned_malloc() or just allocate more buffer than you need and align manually.
There may be other memory optimization concerns, and you may want to limit individual write operations to a memory page size. I never went into that depth. I was still able to write data at speeds very close to the disk's limit. It was significantly faster than using stdio calls.
If you need to do this in the background, you can use overlapped I/O, but to be honest I never understood it. I made a background worker thread dedicated to writing out video buffer and controlled it externally.
The most promising thing that comes to mind is memory mapping the output file. Depending on how the data gets filled, you may even be able to have your existing program write directly to the disk via the pointer, and not need a separate write step at the end. That trusts the OS to efficiently page the file, which it may be having to do with the heap memory anyway... could potentially avoid a disk-to-disk copy.
I'm not sure how to do it in Windows specifically, but you can probably notify the OS of your intended memory access pattern to increase performance further.
(boost::asio has portable support for memory mapped files)
If you want to use std::ofstream you should make sure of the following:
No buffer is used by the file stream. The way to do this to call out.setbuf(0, 0).
Make sure that the std::locale used by stream doesn't do any character conversion, i.e., std::use_facet<std::codecvt<char, char> >(loc).always_noconv() yields true. The "C" locale does this.
With this, I would expect that std::ofstream is as fast as any other approach writing a large buffer. I would also expect it to be slower than using memory mapped I/O because memory mapped I/O should avoid paging sections of the memory when reading them just to write their content.
Open a file with CreateFile, use SetEndOfFile to preallocate the space for the file (to avoid too much fragmentation as you write), then call WriteFile with 2 MB sized buffers (this size works the best in most scenarios) in a loop until you write the entire file out.
FILE_FLAG_NO_BUFFERING may help in some situations and may make the situation worse in others, so no real need to use it, because normally Windows file system write cache is doing its work well.