I have a Binary file of ~400MB which I want to convert to CSV format. The output CSV file will be ~1GB (according to my calculations).
I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV, I am creating a file (or opening an existing file - depending on the user's choice), opening it using fopen and then writing to it using fwrite, line by line.
Coming to my question, this link from CPlusPlus.com says:
The returned stream is fully buffered by default if it is known to not
refer to an interactive device
My query is when I open this file, will it be loaded in RAM? Like when at the end, my file is of ~1GB, will it consume that much RAM or will it be just on the hard disk?
This code will run on Windows as well as Android.

FILE* streams buffering is a C feature and it is used to reduce system call overhead (i.e. do not call read() for each fgetc() which is expensive). Usually buffer is small - i.e. 512 bytes.
Page Cache or similiar mechanisms are different beasts -- they are used to reduce number of disks operations. Usually operating system uses free memory to cache previously read or written data to/from disk so subsequent operations will use RAM.
If there are shortage of free memory -- data is evicted from page cache.

It is operating system and file system and computer specific. And it might not matter that much. Read about page cache.
BTW, you might be interested by sqlite
From an application writer point of view, you should care more about virtual memory and address space of your process than about RAM. Physical RAM is managed by the operating system.
On Linux and Android, if you want to optimize that you might consider (later) using posix_fadvise(2) and perhaps madvise(2). I'm not sure it is worth the pain in your case (since a gigabyte file is not that much today).

I read the binary file and store it in an array of structures (required for other processing too), and when the user wants to export it to CSV
Reading per se doesn't use a lot of memory, like myaut says the buffer is small. The elephant in the room here is: do you you read up all the file and put all the data into structures? or do you start processing after one or few reads to get the minimum amount of data needed to do some processing? Doing the former will indeed use ~400MB or more memory, doing the later will use quite a lot less, that being said, it all depends on the amount of data needed to start processing, and maybe you need all the data loaded at once.


C++: will this disk seek take a very large performance hit?

I am using the STL fstream utilities to read from a file. However, what I would like to do is is read a specified number of bytes and then seek back some bytes and read again from that position. So, it is sort of an overlapped read. In code, this would look as follows:
ifstream fileStream;
fileStream.open("file.txt", ios::in);
size_t read_num = 0;
size_t windows_size = 200;
while (read_num < total_num)
char buffer[1024];
size_t num_bytes_read = fileStream.read(buffer, sizeof(buffer));
read_num += num_bytes_read - 200;
This is the not the only way to solve my problem but will make multi-tasking a breeze (I have been looking at other data structures like circular buffers but that will make multitasking difficult). I was wondering if I can have your input on how much of a performance hit these seek operations might take when processing very large files. I will only ever use one thread to read the data from file.
The files contain large sequence of texts only characters from the set {A,D,C,G,F,T}. Would it also be advisable to open it as a binary file rather than in text mode as I am doing?
Because the file is large, I am also opening it in chucks with the chuck being set to a 32 MB block. Would this be too large to take advantage of any caching mechanism?
On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).
As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.
Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.
If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).
At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.
Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.
On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).
Plasibly, yes. Whenever you seek, the cached file data for that file is (likely to be) discarded, causing extra overhead of, at least, a system call to fetch the data again.
Assuming the file isn't enormous, it MAY be a better choice to read the entire file into memory (or, if you don't need portability, use a memory mapped file, at which point caching of the file content is trivial - again assuming the entire file fits in (virtual) memory).
However, all this is implementation dependent, so measuring performance of each method would be essential - it's only possible to KNOW these things for a given system by measuring, it's not something you can read about and get precise information on the internet (not even here on SO), because there are a whole bunch of factors that affect the behaviour.

Read with File Mapping Objects in C++

I am trying to use Memory Mapped File (MMF) to read my .csv data file (very large and time consuming).
I've heared that MMF is very fast since it caches content of the file, thus users can get access to the content in disk as in memory.
May I know if MMF is any faster than using other reading methods?
If this is true, can anyone show me a simple example how to read a file from disk?
Many thanks in advance.
May I know if MMF is any faster than using other reading methods?
If you're reading the entire file sequentially in one pass, then a memory-mapped file is probably approximately the same as using conventional file I/O.
can anyone show me a simple example how to read a file from disk?
Memory mapped files are typically an operating system feature, so you'd have to tell us which platform you're on to get an example of using it.
If you want to read a file sequentially, you can use the C++ ifstream class or the C run-time functions like fopen, fread, and fclose.
If it's faster or not depends on many different factors (such as what data you are accessing, how you are accessing it, etc. To determine what is right for YOUR case, you need to benchmark different solutions, and see what is best in your case.
The main benefit of memory mapped files is that the data can be copied directly from the filesystem into the user-accessible memory.
In traditional (fstream::read(), fredad(), etc) types of file-reading, the content of the file is read into a temporary buffer in the OS, then (part of) that buffer is copied to the user supplied buffer. This is because the OS can't rely on the memory being there and it gets pretty messy pretty quickly. For memory mapped files, the OS knows directly where the memory is for the different sections (because it's the OS's task to assign that memory and keep track of where it is!) of the file, so the OS can just copy it straight in.
However, I strongly suspect that the method of reading the file is a minor part, and the actual interpretation/parsing/copying out of the file may well be a large part. [Speculation, we haven't seen your code, of course]. And of course, the I/O speed available from the DISK itself may play a large factor if the file is very large.

How to asynchronously flush a memory mapped file?

I am using memory mapped files to have read-/write-access to a large number of image files (~10000 x 16 MB) under Windows 7 64bit. My goals are:
Having as much data cached as possible.
Being able to allocate new images and write to those as fast as possible.
Therefore I am using memory mapped files to access the files. Caching works well, but the OS is not flushing dirty pages until I am nearly out of physical memory. Because of that allocating and writing to new files is quite slow once the physical memory is filled.
One solution would be to regularly use FlushViewOfFile(), but this function does not return until the data has been writen to disk.
Is there a way to asynchroniously flush a file mapping? The only solution I found is to Unmap() and MapViewOfFile() again, but using this approach I can not be sure to get the same data pointer again. Can someone suggest a better approach?
Reading the WINAPI documentation a little longer, it seems that I found a suitable solution to my problem:
Calling VirtualUnlock() on a memory range that is not locked results in a flushing of dirty pages.
I heard that FlushViewOfFile() function does NOT wait until it physically write to file.
The FlushViewOfFile function does not flush the file metadata, and it does not wait to return until the changes are flushed from the underlying hardware disk cache and physically written to disk.
After call "FlushFileBuffers( ... )" then your data will be physically written to disk.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

synchronized write operation in C

I am working on a smart camera that runs linux. I capture images from the camera streaming software and writes the images on a SD card (attached with the camera). For writing the individual JPEG images, I used fopen and fwrite C functions. For synchronizing the disk write operation, I use fflulsh(pointer) to flush the buffers and write the data on the SD card. But it seems it has no effect as the write operation uses system memory and the memory gets decreased after every write operation. I also used low-level open and write functions in conjunction with fsync (filedesc), but it also has no effect.
The flushing of buffers take place only when I dismount the SD card and then the memory is freed. How can I disable this cache write instead of SD card write? or how can I force the data to be written on the SD card at the same time instead of using the system memory?
sync(2) is probably your best bet:
SYNC(2) Linux Programmer's Manual SYNC(2)
sync - commit buffer cache to disk
#include <unistd.h>
void sync(void);
sync() first commits inodes to buffers, and then buffers to disk.
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually wait. (This still
does not guarantee data integrity: modern disks have large caches.)
You can set the O_SYNC if you open the file using open(), or use sync() as suggested above.
With fopen(), you can use fsync(), or use a combination of fileno() and ioctl() to set options on the descriptor.
For more details see this very similar post: How can you flush a write using a file descriptor?
Check out fsync(2) when working with specific files.
There may be nothing that you can really do. Many file systems are heavily cached in memory so a write to a file may not immediately be written to disk. The only way to guarantee a write in this scenario is to actually unmount the drive.
When mounting the disk, you might want to specify the sync option (either using the -oflag in mount or on your fstab line. This will ensure that at least your writes are written synchronously. This is what you should always use for removable media.
Just because it's still taking up memory doesn't mean it hasn't also been written out to storage - a clean (identical to the copy on physical storage) copy of the data will stay in the page cache until that memory is needed for something else, in case an application later reads that data back.
Note that fflush() doesn't ensure the data has been written to storage - if you are using stdio, you must first use fflush(f), then fsync(fileno(f)).
If you know that you will not need to read that data again in the forseeable future (as seems likely for this case), you can use posix_fadvise() with the POSIX_FADV_DONTNEED flag before closing the file.