Writing data to a file in C++ - most efficient way? - c++

In my current project I'm dealing with a big amount of data which is being generated on-the-run by means of a "while" loop. I want to write the data onto a CSV file, and I don't know what's better - should I store all the values in a vector array and write to the file at the end, or write in every iteration?
I guess the first choice it's better, but I'd like an elaborated answer if that's possible. Thank you.

Make sure that you're using an I/O library with buffering enabled, and then write every iteration.
This way your computer can start doing disk access in parallel with the remaining computations.
PS. Don't do anything crazy like flushing after each write, or opening and closing the file each iteration. That would kill efficiency.

The most efficient method to write to a file is to reduce the number of write operations and increase the data written per operation.
Given a byte buffer of 512 bytes, the most inefficient method is to write 512 bytes, one write operation at a time. A more efficient method is to make one operation to write 512 bytes.
There is overhead associated with each call to write to a file. That overhead consists of locating the file on the drive in it's catalog, seeking to the a new location on the drive and writing. The actual operation of writing is quite fast; it's this seeking and waiting for the hard drive to spin up and get ready that wastes your time. So spin it up once, keep it spinning by writing a lot of stuff, then let it spin down. The more data written while the platters are spinning the more efficient the write will be.
Yes, there are caches everywhere along the data path, but all that will be more efficient with large data sizes.
I would recommend writing the the formatted to a text buffer (that is a multiple of 512), and at certain points, flush the buffer to the hard drive. (512 bytes is a common sector size multiple on hard drives).
If you like threads, you can create a thread that monitors the output buffer. When the output buffer reaches a threshold, the thread writes the contents to drive. Multiple buffers can help by having the fast processor fill up buffers while other buffers are written to the slow drive.
If your platform has DMA you might be able to speed things up by having the DMA write the data for you. Although I would expect a good driver to do this automatically.
I do use this technique on an embedded system, using a UART (RS232 port) instead of a hard drive. By using the buffering, I'm able go get about 80% efficiency.
(Loop unrolling may also help.)

The easiest way is in console with > operator. In linux:
./miProgram > myData.txt
Thats get the input of the program and puts in a file.
Sorry for the english :)

Related

Multithreaded Files Reading

I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?
What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal
for mmap().
Note the last - just using the data once is a bad use-case for mmap().
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open() to get a single int file descriptor.
Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
Using a single producer thread, read large chunks (say 64k to 1MB+)
into one of N page-aligned buffers.
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the
buffer back to the reading thread.
You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.
Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
Obtain the file size (fstat)
Allocate a buffer that size.
Open and read the whole file into the buffer.
Figure out how to partition the data per thread and spin off the threads.
Time that algorithm.
Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.

Is it better to read an entire file in std::string or to manipulate a file with std::ifstream?

I am actually developing scientific C++ simulation programs which read data, compute lots of values from them and finally store the results in a file. I wanted to know if reading all the data at once at the beginning of the program is faster than keep accessing the file via std::ifstream during the program.
The data I am using are not very big (several MB), but I do not even know what "big" is for a heap allocation...
I guess it depends on the data and so on (and after some test, effectively it depends), but I was wondering on what it was depending and whether there is a kind of general principle we should be following.
Long story short, the question is: does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators?
Take a look at mmap. This API allows you to map a file descriptor into your address space using the same paging mechanism as is used for RAM. This way you should get both the benefit random access to data while not unnecessarily copying unneeded data into RAM.
Is reading all the data at once at the beginning of the program faster than keep accessing the file via std::ifstream during the program? Yes, probably it is. Keep in mind that working memory is fast and expensive, while storage memory (a hard drive) exists precisely to be cheap at the cost of being slow.
What is "big" for a heap allocation? The operating system is going to try to fool your process into thinking that all existing working memory is free. This is not actually true, and the OS will "swap" one type of memory for the other if some process requests too much memory. But in principle, you should think that a heap allocation is big if it is comparable to the total size of working memory.
Does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators? No, it is not faster, but it has another advantage: it is memory-efficient. If you only put the needed data into memory in order to work with them, you are saving memory for all other processes in the machine (which could be other threads of your program, for instance). This is a very interesting property in order to have scalable software.
(Expect this to be closed because it is an "opinion based" question.)
My thoughts:
This sounds like premature optimization. Write it the easy way, then optimize if it is too slow.
Working in memory is generally thousands of times faster. Heap allocations slow down based on the number of allocations, not the size of the allocations. It does not sound like you are working with a whole lot of data though.
If your files are "several MB" then the OS will probably cache it anyway.
Reading data from a file in large chunks is a lot faster than many read requests of small size. For example, 1 read of 10MB is faster than 10 reads of 1MB.
When I optimize file I/O, I read the data into a uint8_t buffer and then parse the buffer. One thorn with this method is reading of text files. There is a possibility that the text encoded data may span across a buffer boundary. For example, you have 4 numbers per text line and only 2 are in the buffer (or only 2 digits of the number are in the buffer). You will have to write code to handle these cases.
If you consider your program as a pipeline, you may be able to optimize futher. You can implement threads: a reading thread, a processing thread, and a writing (output) thread. The reading thread reads into buffers. When there is enough data for processing, the reading thread wakes up the processing thread. The processing thread processes the data that was read, and when there is some output, it stores it into a shared buffer and wakes up the output thread. So with the pipeline model, the data enters the pipe via the reading thread. At some point in the pipe, the processing thread processes the data. The writing thread takes the data from the processing thread and outputs it (exiting the pipeline).
Also, organizing your data so it fits into a processor cache line will also speed up your program.

Reading inputs in smaller, more frequent reads or one larger read

I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?
The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.
The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.