The fastest way to write data while producing it - c++

In my program I am simulating a N-body system for a large number of iterations. For each iteration I produce a set of 6N coordinates which I need to append to a file and then use for executing the next iteration. The code is written in C++ and currently makes use of ofstream's method write() to write the data in binary format at each iteration.
I am not an expert in this field, but I would like to improve this part of the program, since I am in the process of optimizing the whole code. I feel that the latency associated with writing the result of the computation at each cycle significantly slows down the performance of the software.
I'm confused because I have no experience in actual parallel programming and low level file I/O. I thought of some abstract techniques that I imagined I could implement, since I am programming for modern (possibly multi-core) machines with Unix OSes:
Writing the data in the file in chunks of n iterations (there seem to be better ways to proceed...)
Parallelizing the code with OpenMP (how to actually implement a buffer so that the threads are synchronized appropriately, and do not overlap?)
Using mmap (the file size could be huge, on the order of GBs, is this approach robust enough?)
However, I don't know how to best implement them and combine them appropriately.

Of course writing into a file at each iteration is inefficient and most likely slow down your computing. (as a rule of thumb, depends on your actuel case)
You have to use a producer -> consumer design pattern. They will be linked by a queue, like a conveyor belt.
The producer will try to produce as fast as it can, only slowing if the consumer can't handle it.
The consumer will try to "consume" as fast as it can.
By splitting the two, you can increase performance more easily because each process is simpler and has less interferences from the other.
If the producer is faster, you need to improve the consumer, in your case by writing into file in the most efficient way, chunk by chunk most likely (as you said)
If the consumer is faster, you need to improve the producer, most likely by parallelizing it as you said.
There is no need to optimize both. Only optimize the slowest (the bottleneck).
Practically, you use threads and a synchronized queue between them. For implementation hints, have a look here, especially §18.12 "The Producer-Consumer Pattern".
About flow management, you'll have to add a little bit more complexity by selecting a "max queue size" and making the producer(s) wait if the queue has not enough space. Beware of deadlocks then, code it carefully. (see the wikipedia link I gave about that)
Note : It's a good idea to use boost threads because threads are not very portable. (well, they are since C++0x but C++0x availability is not yet good)

It's better to split operation into two independent processes: data-producing and file-writing. Data-producing would use some buffer for iteration-wise data passing, and file-writing would use a queue to store write requests. Then, data-producing would just post a write request and go on, while file-writing would cope with the writing in the background.
Essentially, if the data is produced much faster than it can possibly be stored, you'll quickly end up holding most of it in the buffer. In that case your actual approach seems to be quite reasonable as is, since little can be done programmatically then to improve the situation.

If you don't want to play with doing stuff in a different threads, you could try using aio_write(), which allows asynchronous writes. Essentially you give the OS the buffer to write, and the function returns immediately, and finishes the the write while your program continues, you can check later to see if the write has completed.
This solution still does suffer from the producer/consumer problem mentioned in other answers, if your algorithm is producing data faster than it can be written, eventually you will run out of memory to store the results between the algorithm and the write, so you'd have to try it and see how it works out.

"Using mmap (the file size could be huge, on the order of GBs, is this
approach robust enough?)"
mmap is the OS's method of loading programs, shared libraries and the page/swap file - it's as robust as any other file I/O and generally higher performance.
BUT on most OS's it's bad/difficult/impossible to expand the size of a mapped file while it's being used. So if you know the size of the data, or you are only reading, it's great. For a log/dump that you are continually adding to it's less sutiable - unless you know some maximum size.

Related

erlang processes and message passing architecture

The task I have in hand is to read the lines of large file, process them, and return ordered results.
My algorithm is:
start with master process that will evaluate the workload (written in the first line of the file)
spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
master receives all sub-results, sort, and return
so basically no communication needed between workers.
My questions:
How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
erlang message passing architecture
Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).
The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn N workers, and distribute work between them, waiting to hear back. Re-run the program while changing N if you don't like how it runs.
Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.
But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.
-
Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.
The tricky bit is really that functions like pread/2-3 takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the word my in this is my line\nhere it goes\n, one worker will see itself have an incomplete line, while the other will report only on my line\n, missing the prior this is.
Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.
The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.
In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.
-
Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.
A more realistic very performant program would use very few messages of a very small size.
The fun thing here is that your problem is inherently data-based. So there's a few things you can do:
make sure you read text in a binary format; large binaries (> 64b) get allocated on a global binary heap, are shared around and GC'd with reference counting
Hand in information on what needs to be done rather than the data for doing it; this one would need measuring, but the lead process could go over the file, note where lines end, and just hand byte offsets to the workers so they can go and read the file themselves; do note that you'll end up reading the file twice, so if memory allocation is not your main overhead, this will likely be slower
Make sure the file is read in raw or ram mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes); raw and ram modes gives the file descriptor directly to the calling process and is a lot faster.
First worry about writing a clear, readable and correct program. Only if it is too slow should you attempt to refactor and optimize it; you may very well find it good enough on the first try.
I hope this helps.
P.S. You can try the really simple stuff at first:
either:
read the whole file at once with {ok, Bin} = file:read_file(Path) and split lines (with binary:split(Bin, <<"\n">>, [global])),
use {ok, Io} = file:open(File, [read,ram]) and then use file:read_line(Io) on the file descriptor repeatedly
use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}]) and then use file:read_line(Io) on the file descriptor repeatedly
call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines) to run everything in parallel automatically (it will spawn one process per line)
call lists:sort/1 on the result.
Then from there you can refine each step if you identify them as problematic.

C/C++ Linux: fastest write of a fixed chunk of memory to file (1 Hz)

On a Linux system, I have one 7MB chunk of memory of fixed size (no growth) whose contents I refresh in a real-time application.
I need to write this chunk of memory to disk (same file) once per second.
Taking into consideration modern (late 2011) CPUs and HDDs, what is the most efficient way to implement this functionality? I don't care if the write actually takes some time, but as this is a real-time app, I need to return to the running app ASAP.
What methodologies should I be trying?
My baseline is a standard baseline fopen(), binary fwrite(), fclose() cycle.
I have read that mmap() might be useful. Maybe asynchronous I/O? Are there other methodologies that I should be benchmarking? Off the top of your head, which methodology do you think would be fastest?
mmap(2) is the way to go. Just call msync(2) with MS_ASYNC when you want to write it.
I'd combine the two approaches mentionned: I'd use mmap to map the
memory to the file, then set up a separate thread (with lower priority)
to msync it every second. (In this case, the actual arguments to
msync are not too important; you don't need MS_ASYNC, since you
won't be blocking the main thread.)
Another alternative possibly worth trying would be asynchronous IO.
It's not clear to me from my documentation what happens if you never
recover the results, however, so you may need some sort of reaper code
to prevent lost resources. (Asynchronous IO seems rather underspecified
in Posix, which IMHO is a good reason to avoid it.)

C++ Boost Multithread is slower than single thread because of CPU type?

I had posted some boost multithreads before. This time I just curious and disappointed because I thought multithreads suppose to be faster than single one.
Two threads are FILE I/O read/parse the CSV data. When I used multithreads, it took about 40 seconds average per machine PENTIUM D CPU from DELL DESKTOP OPTILEX 745.
With single thread, it took about 8-10 seconds average same PC.
I had tried to use completely different parameters name from these two threads, the result is no different. If someone had been used c++ boost multi-threads for reading big data file and parsing before, I would love to hear your opinions. Thanks.
Andrew
Two threads are FILE I/O read/parse the CSV data.
If they're reading the same file with the same file handle, then they might be spending most of their time blocked waiting for the other thread to get done. If they're using different file handles to read the same file, they're forcing the disk to keep seeking back and forth, which isn't as efficient an operation as a straight sequential read.
Threading doesn't speed up big file reading and parsing. What it does is let you do something else entirely while the file is being read and parsed.
You've created an I/O bottleneck, which threading does not help with. Threading is intended for reducing CPU bottlenecks when the algorithm can be broken into independent threads of execution; algorithms that have a lot of dependency on previous output (file parsing is one case) generally don't thread well.
If you can split up the parsing problem and have each thread parse a different part of the file, you might get a little improvement, but probably not since the seeking will be wasting your time. If you can have one thread doing bulk reading and some preprocessing, then handing off chunks to a thread pool for the real heavy processing (is ther any?), then you might see some noticeable improvement over single threading.
This is all general and a bit stream-of-consciousness, but it's hard to do much more with what you're giving. I hope it helps.
Without seeing your code it's hard to say exactly what's going on, but in general, multiple threads don't necessarily get you better performance, and in fact can very often lead to obvious performance degradation.
In your situation, if you are having both threads read and parse, then they could be contending for I/O, and possibly the locks surrounding any shared read/write memory areas, both of which would introduce a slow-down where the single-threaded version would have no issue.
To do this properly, you would probably want a single thread reading from the file, and another thread parsing the data as it came in on a producer/consumer queue. This would minimize the lock contention (since it can be implemented with waiters only), and would ensure that you were acutally taking advantage of the parralellization available in your problem.
That being said, a single-threaded version might still be faster; it's often the case.

What is the Fastest Method for High Performance Sequential File I/O in C++?

Assuming the following for...
Output:
The file is opened...
Data is 'streamed' to disk. The data in memory is in a large contiguous buffer. It is written to disk in its raw form directly from that buffer. The size of the buffer is configurable, but fixed for the duration of the stream. Buffers are written to the file, one after another. No seek operations are conducted.
...the file is closed.
Input:
A large file (sequentially written as above) is read from disk from beginning to end.
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Some possible considerations:
Guidelines for choosing the optimal buffer size
Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
I realize that this will have platform-specific considerations. I welcome general guidelines as well as those for particular platforms.
(my most immediate interest in Win x64, but I am interested in comments on Solaris and Linux as well)
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.
Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).
Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.
Some possible considerations:
Guidelines for choosing the optimal buffer size
Page size for starters, but be ready to tune from there.
Will a portable library like boost::asio be too abstracted to expose the intricacies
of a specific platform, or can they be assumed to be optimal?
Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.
A general advice is to turn off buffering and read/write in large chunks (but not too large, then you will waste too much time waiting for the whole I/O to complete where otherwise you could start munching away at the first megabyte already. It's trivial to find the sweet spot with this algorithm, there's only one knob to turn: the chunk size).
Beyond that, for input mmap()ing the file shared and read-only is (if not the fastest, then) the most efficient way. Call madvise() if your platform has it, to tell the kernel how you will traverse the file, so it can do readahead and throw out the pages afterwards again quickly.
For output, if you already have a buffer, consider underpinning it with a file (also with mmap()), so you don't have to copy the data in userspace.
If mmap() is not to your liking, then there's fadvise(), and, for the really tough ones, async file I/O.
(All of the above is POSIX, Windows names may be different).
For Windows, you'll want to make sure you use the FILE_FLAG_SEQUENTIAL_SCAN in your CreateFile() call, if you opt to use the platform specific Windows API call. This will optimize caching for the I/O. As far as buffer sizes go, a buffer size that is a multiple of the disk sector size is typically advised. 8K is a nice starting point with little to be gained from going larger.
This article discusses the comparison between async and sync on Windows.
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
As you noted above it all depends on the machine / system / libraries that you are using. A fast solution on one system may be slow on another.
A general guideline though would be to write in as large of chunks as possible.Typically writing a byte at a time is the slowest.
The best way to know for sure is to code a few different ways and profile them.
You asked about C++, but it sounds like you're past that and ready to get a little platform-specific.
On Windows, FILE_FLAG_SEQUENTIAL_SCAN with a file mapping is probably the fastest way. In fact, your process can exit before the file actually makes it on to the disk. Without an explicitly-blocking flush operation, it can take up to 5 minutes for Windows to begin writing those pages.
You need to be careful if the files are not on local devices but a network drive. Network errors will show up as SEH errors, which you will need to be prepared to handle.
On *nixes, you might get a bit higher performance writing sequentially to a raw disk device. This is possible on Windows too, but not as well supported by the APIs. This will avoid a little filesystem overhead, but it may not amount to enough to be useful.
Loosely speaking, RAM is 1000 or more times faster than disks, and CPU is faster still. There are probably not a lot of logical optimizations that will help, except avoiding movements of the disk heads (seek) whenever possible. A dedicated disk just for this file can help significantly here.
You will get the absolute fastest performance by using CreateFile and ReadFile. Open the file with FILE_FLAG_SEQUENTIAL_SCAN.
Read with a buffer size that is a power of two. Only benchmarking can determine this number. I have seen it to be 8K once. Another time I found it to be 8M! This varies wildly.
It depends on the size of the CPU cache, on the efficiency of OS read-ahead and on the overhead associated with doing many small writes.
Memory mapping is not the fastest way. It has more overhead because you can't control the block size and the OS needs to fault in all pages.
On Linux, buffered reads and writes speed up things a lot up, increasingly with increasing buffers sizes, but the returns are diminishing and you generally want to use BUFSIZ (defined by stdio.h) as larger buffer sizes won't help much.
mmaping provides the fastest access to files, but the mmap call itself is rather expensive. For small files (16KiB) read and write system calls win (see https://stackoverflow.com/a/39196499/1084774 for the numbers on reading through read and mmap).

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.