Best way to read/write file in multithreaded environment (C++) - c++

i have a multithreaded program that reads and writes files. One thread receives data and writes them in a file. Every 250 Mb of data, a new file is created. Multiple other threads can read into these files to retrieve data. I'm using C++ std file stream.
To prevent problems, my current implementation uses two file descriptors for the same file: one for readers and one for the writer. A mutex protects from multiple access at the same time, and the file descriptor position is moved each time the mutex owner needs it.
I really need to be able to read in the file as fast as possible, and the mutex doesn't really help me.
Firstly, I would like to know if it's safe to read and write the file or have multiple reads at the same time (on every platform).
Secondly, if yes, I would like to know how it is safe for the hardware like the "Disk read-and-write head" for a HDD. The software works on the disk all the time to save data, and i don't want my algorithm to decrease too much the hard disk life time (already short).
Thank you for your help

There is no problem regarding multiple threads reading the same file.
Now, if I understood your description correctly, you do not modify already-written data, you just continuously append data to your file until it reaches 250Mb, then you continue writing on a new file.
If this is the case, you may not need a mutex at all. For instance, you might be able to keep your whole "file" into memory until it reaches 250mb, and only then you would write it all to disk, so you know that any files already on disk aren't going to be written anymore and can be read freely with no worries. As for the file that is still being written, you can have a global integer that holds how many bytes (or strings or whatever you use) have already been written, and reading-threads are limited by this integer, which does not need a lock, as long as you only update the integer after you have already written the data. (since you said there is only 1 thread writing data).
Simply reading the integer cannot corrupt it even when being done by multiple threads at the same time and being written by a single one, so this will ensure your reader threads will not read beyond the limit, and such limit will always be safe and consistent, while the writer-thread can peacefully write data in an area that is guaranteed to not be bothered by read-threads until it is finished.
As for your second question, if you are indeed able to keep the currently-being-written file fully in memory, that will already save up some HDD usage, as well as time. Additionally, keep in mind most modern HDDs have 32Mb+ of cache, so it is not like every read and write will be directly hitting the HDD itself, unless you have a ton of threads reading random files and random parts of them all the time. If that is the case, there is probably not much you can do to help the HDD. And if that's not the case, there is not much to worry about, as the OS and the caches will do what they were meant to do :)

Related

Thread-safe file updates

I need to learn how to update a file concurrently without blocking other threads. Let me explain how it should work, needs, and how I think it should be implemented, then I ask my questions:
Here is how the worker works:
Worker is multithreaded.
There is one very large file (6 Terabyte).
Each thread is updating part of this file.
Each write is equal to one or more disk blocks (4096 bytes).
No two worker write at same block (or same group of blocks) at the same time.
Needs:
Threads should not block other blocks (no lock on file, or minimum possible number of locks should be used)
In case of (any kind of) failure, There is no problem if updating block corrupts.
In case of (any kind of) failure, blocks that are not updating should not corrupts.
If file write was successful, we must be sure that it is not buffered and be sure that actually written on disk (fsync)
I can convert this large file to as many smaller files as needed (down to 4kb files), but I prefer not to do that. Handling that many files is difficult, and needs a lot of file handles open/close operations, which has negative impact on performance.
How I think it should be implemented:
I'm not much familiar with file manipulation and how it works at operating system level, but I think writing on a single block should not corrupt other blocks when errors happen. So I think this code should perfectly work as needed, without any change:
char write_value[] = "...4096 bytes of data...";
int write_block = 12345;
int block_size = 4096;
FILE *fp;
fp = fopen("file.txt","w+");
fseek(fp, write_block * block_size, SEEK_SET);
fputs(write_value, fp);
fsync(fp);
fclose(fp);
Questions:
Obviously, I'm trying to understand how it should be implemented. So any suggestions are welcome. Specially:
If writing to one block of a large file fails, what is the chance of corrupting other blocks of data?
In short, What things should be considered on perfecting code above, (according to the last question)?
Is it possible to replace one block of data with another file/block atomically? (like how rename() system call replaces one file with another atomically, but in block-level. Something like replacing next-block-address of previous block in file system or whatever else).
Any device/file system/operating system specific notes? (This code will run on CentOS/FreeBSD (not decided yet), but I can change the OS if there is better alternative for this problem. File is on one 8TB SSD).
Threads should not block other blocks (no lock on file, or minimum possible number of locks should be used)
Your code sample uses fseek followed by fwrite. Without locking in-between those two, you have a race condition because another thread could jump in-between. There are three reasonable solutions:
Use flockfile, followed by regular fseek and fwrite_unlocked then funlock. Those are POSIX-2001 standard
Use separate file handles per thread
Use pread and pwrite to do IO without having to worry about the seek position
Option 3 is the best for you.
You could also use the asynchronous IO from <aio.h> to handle the multithreading. It basically works with a thread-pool calling pwrite on most Unix implementations.
In case of (any kind of) failure, There is no problem if updating block corrupts
I understand this to mean that there should be no file corruption in any failure state. To the best of my knowledge, that is not possible when you overwrite data. When the system fails in the middle of a write command, there is no way to guarantee how many bytes were written, at least not in a file-system agnostic version.
What you can do instead is similar to a database transaction: You write the new content to a new location in the file. Then you do an fsync to ensure it is on disk. Then you overwrite a header to point to the new location. If you crash before the header is written, your crash recovery will see the old content. If the header gets written, you see the new content. However, I'm not an expert in this field. That final header update is a bit of a hand-wave.
In case of (any kind of) failure, blocks that are not updating should not corrupts.
Should be fine
If file write was successful, we must be sure that it is not buffered and be sure that actually written on disk (fsync)
Your sample code called fsync, but forgot fflush before that. Or you set the file buffer to unbuffered using setvbuf
I can convert this large file to as many smaller files as needed (down to 4kb files), but I prefer not to do that. Handling that many files is difficult, and needs a lot of file handles open/close operations, which has negative impact on performance.
Many calls to fsync will kill your performance anyway. Short of reimplementing database transactions, this seems to be your best bet to achieve maximum crash recovery. The pattern is well documented and understood:
Create a new temporary file on the same file system as the data you want to overwrite
Read-Copy-Update the old content to the new temporary file
Call fsync
Rename the new file to the old file
The renaming on a single file system is atomic. Therefore this procedure will ensure after a crash, you either get the old data or the new one.
If writing to one block of a large file fails, what is the chance of corrupting other blocks of data?
None.
Is it possible to replace one block of data with another file/block atomically? (like how rename() system call replaces one file with another atomically, but in block-level. Something like replacing next-block-address of previous block in file system or whatever else).
No.

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?
Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

Multithreaded Files Reading

I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?
What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal
for mmap().
Note the last - just using the data once is a bad use-case for mmap().
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open() to get a single int file descriptor.
Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
Using a single producer thread, read large chunks (say 64k to 1MB+)
into one of N page-aligned buffers.
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the
buffer back to the reading thread.
You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.
Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
Obtain the file size (fstat)
Allocate a buffer that size.
Open and read the whole file into the buffer.
Figure out how to partition the data per thread and spin off the threads.
Time that algorithm.
Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.

Thread Optimization [duplicate]

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
Construct a new ifstream for the second thread, open it on the same file.
Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.
Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.
Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.
If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.
Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.
Other option:
Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).
It really depends on your system. A modern system will generally read
ahead; seeking within the file is likely to inhibit this, so should
definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to /dev/null is usually sufficient for this.
That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single ifstream (and thus a single file descriptor), and constantly
seeking, won, but you never know.
I'd also recommend system specific solutions like mmap, but if you've
got that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use mmap, mapping sections of it
at a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into
smaller files? That might be the fastest solution of all. (Ideally,
this would be done where the data is generated or imported into the
system.)
My vote would be a single reader, which hands the data to multiple worker threads.
If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

Performance when reading a file line by line vs reading the whole file

Is there a noticeable difference (in theory) when reading a while line by line compared to reading the whole file in one go?
Reading the whole file does have a negative impact on the amount of memory used but does it work faster?
I need to read a file and process each line. I don't know whether I should read one line at a time and process it, or read the whole file, process all, then write to output.
I've already setup the prgm to read line by line and I want to know whether it is worth the effort to change it to read the whole file (not easy given my setup).
Thanks,
Reading the whole file will be slightly faster -- but not much!
But be careful reading the whole file is not scalable as you are limited by the available memory in the system, once the file size exceeds the size of RAM avaibale to your program it will start using swap space will be much slower. If the file size exceeds the size of virtual memory available then your program will crash.
I think it would depend on the needs of your application (like most things, I know). Reading a 1 MB file in Node js is ~3-4x faster with fs.readFile() than using a readable stream or line reader as far as just file reading goes. Streams may offer some additional performance if the file is very large and you are processing input on the fly. It may also be ideal if your application is already consuming a lot of memory as a Node process has a ~1.5 GB memory limit on 64 bit systems. Processing chunks as they come in may also be more performant if the source of the data is slow relative to how fast the cpu can process it (archives on HDD or tape, network connections like TCP). As far as reading a file into memory vs. streaming it into memory, I am guessing the function call overhead of emitting data events and switching to the processing function callback slow down the process.
Like others, I believe doing bigger reads will improve the performance of your application some, but don't expect miracles, I/O is already buffered at the OS layer, so you'll only be gaining by reducing the overhead of having too many read calls. Reading the whole file in one go is dangerous, unless you know the maximum possible size for your input files. A most reasonable approach is to read the file in large blocks.
If you wanted to improve even more, you should consider overlapping the I/O with the processing. Let's say you read the input file in blocks of 128MB. On your main thread you read the first 128MB block and then pass it on to a worker thread for processing. While the worker thread gets to work the main thread reads the second 128MB block. From that point on, while the worker thread is processing block N, the main thread is reading block N+1 from disk.
Reading the entire file into memory is generally not a good idea because the files can be huge and may take up a lot of memory and in worst case run out of memory. So, to balance performance and memory usage, you read a block of file into a buffer and parse through the buffer. When you are done processing the block, read the next block until EOF.
Deciding on a good block size will have to be done based on what you want to achieve.
To be honest, after studying the efficiency for a while during my degree, I came to conclude this about your question: it depends how often this file is going to be read. If you reading it once, then do the whole thing, because that would just free the process for other tasks.
Again one more thing to keep in your mind, is the file going to be edited later and require update (as in read the updated part only?) if so you might need to set a marker to recgonise where to read from (and then again how often it is updated?). But yes if it is a one time job, go ahead and read it as a whole, as long as you do not require tokens to be created of certain literals in the file.
hope this helps.
One factor is how much data you are going to be reading, and so how long the program initially takes to run, i.e. whether there's any benefit in working on the performance.
See the book quotes in this answer for some good, general advice on thinking about software performance.
(I know you're for an answer in theory, but this aspect of when to worry about performance is also important, whenever you have a finite amount of time to spend.)