I have a C++ application that handles multiple connections and writes to file. When using function _sopen_s with parameter _SH_DENYNO all threads that are working simultaneously are writing in the file and I see no loss of data. Can you tell me how the access to the file is managed by the function so there is no loss of data ?
Kind regards
If you're using write(), or some OS-supplied variant of it, individual write() calls tend to be implemented such that each call to write() is atomic because of the implications of this POSIX statement regarding write():
On a regular file or other file capable of seeking, the actual writing
of data shall proceed from the position in the file indicated by the
file offset associated with fildes. Before successful return from
write(), the file offset shall be incremented by the number of bytes
actually written. On a regular file, if this incremented file offset
is greater than the length of the file, the length of the file shall
be set to this file offset.
That's somewhat hard to implement in a way that allows for interleaved data from separate write() calls on the same file descriptor, because if data is interleaved the change in the file offset upon completion wouldn't be equal to the "number of bytes actually written". So that statement could be interpreted as an implied requirement for the atomicity of write() calls on any "regular file or other file capable of seeking".
Also, there's the explicit POSIX requirement that write() calls to pipes of less than or equal to PIPE_BUF bytes be atomic:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write
requests for more than {PIPE_BUF} bytes are atomic, but requires that
writes of {PIPE_BUF} or fewer bytes shall be atomic.
Since write() just gets an int for a file descriptor with no other information directly available as to what the file descriptor refers to, the simplest way to implement write() in a way that meets the POSIX requirement for atomic write() to a pipe is to make each write() call atomic.
So, while there's no requirement for atomicity unless you're writing less than or equal to PIPE_BUF bytes to an actual pipe, write() tends to be implemented atomically for everything.
Now, that doesn't mean that whatever the file descriptor points to won't break the data up or interleave it with other data. For example, I wouldn't be surprised at all to see interleaved data if two threads were each to try calling one write() operation to stream several GB of data from MPEG files simultaneously to the same TCP socket.
And you're not actually calling write(). But the underlying implementation is likely shared.
Related
My Operating Systems professor was talking today about how a read system call is unbuffered while a istream::read function has a buffer. This left me a bit confused as you still make a buffer for the istream::read function when using it.
The only thing I can think of is that there are more than one buffers in the istream::read function call. Why?
What does the istream::read() function do differently from the read() function system call?
The professor was talking about buffers internal to the istream rather than the buffer provided by the calling code where the data ends up after the read.
As an example, say you are reading individual int objects out of an istream, the istream is likely to have an internal buffer where some number of bytes is stored and the next read can be satisfied out of that rather than going to the OS. Note, however, that whatever the istream is hooked to very likely has internal buffers as well. Most OSes have means to perform zero-copy reads (that is, read directly from the I/O source to your buffer), but that facility comes with severe restrictions (read size must be multiple of some particular number of bytes, and if reading from a disk file the file pointer must also be on a multiple of that byte count). Most of the time such zero-copy reads are not worth the hassle.
Suppose a program has a caching mechanism where, at the end of some specific calculation, the program writes the output of that calculation to the disk to avoid re-computing it later, when the program is re-ran. It does so for a large number of calculations, and saves each output to separate files (one per calculation, with filenames determined by hashing the computation parameters). The data is written to the file with standard C++ streams:
void* data = /* result of computation */;
std::size_t dataSize = /* size of the result in bytes */;
std::string cacheFile = /* unique filename for this computation */;
std::ofstream out(cacheFile, std::ios::binary);
out << dataSize;
out.write(static_cast<const char *>(data), dataSize);
The calculation is deterministic, hence the data written to a given file will always be the same.
Question: is it safe for multiple threads (or processes) to attempt this simultaneously, for the same calculation, and with the same output file? It does not matter if some threads or processes fail to write the file, as long as at least one succeeds, and as long as all programs are left in a valid state.
In the manual tests I ran, no program failure or data corruption occurred, and the file was always created with the correct content, but this may be platform-dependent. For reference, in our specific case, the size of the data ranges from 2 to 50 kilobytes.
is it safe for multiple threads (or processes) to attempt this simultaneously, for the same calculation, and with the same output file?
It is a race condition when multiple threads try to write into the same file, so that you may end up with a corrupted file. There is no guarantee that ofstream::write is atomic and that depends on a particular filesystem.
The robust solution for your problem (works both with multiple threads and/or processes):
Write into a temporary file with a unique name in the destination directory (so that the temporary and the final files are in the same filesystem for rename to not move data).
rename the temporary file to its final name. It replaces the existing file if one is there. Non-portable renameat2 is more flexible.
It is possible to synchronise threads within the same process to write to one file using thread synchronisation. However, this isn't possible between different processes, so it is better to avoid it. There isn't anything in the C++ standard library that you can use for that.
Operating systems do provide special functions for locking files that are guaranteed to be atomic (like lockf on Linux or LockFile(Ex) on Windows). You might like to check them out.
Lots of people want to switch off buffering on their file descriptors. I want the reverse: I deliberately want to configure a file descriptor to buffer, say, 1K of data before writing to disk.
The reason is that I'm writing a unit test for a "flush" function of a C++ class. To test that it's working I want to write some data, check the size of the file on disk, then flush, then check that the size has grown. But in practice, by the time I do the first file size check the data has already been written.
Note that I'm working with a raw file descriptor here, not a stream or anything.
This is on linux if that matters.
How to force a file descriptor to buffer my output
If you're using the POSIX write() (or a variant of it), you can't.
The write() call must behave thus:
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the
write() for that position until such byte positions are again
modified.
Any subsequent successful write() to the same byte position in the file shall overwrite that file data.
Those requirements mean the data written is visible to any other process on the system, and to be consistent, if the data written causes the file size to grow, the file size reported by the kernel must reflect the data written.
I want to write some data, check the size of the file on disk, then flush, then check that the size has grown.
That fundamentally doesn't work with write(). The file size will grow as the data written - write() does not buffer data.
If you want it to do that, you'll have to implement your own filesystem - one that isn't POSIX compliant.
Although i have read about buffer and stream and it's working with files in c++ but i don't know what is the need of a buffer if a stream is there, stream is always there to transfer the data of one file to the program. So why do we use buffers to store data(performing same task that stream does) and what are buffered and unbuffered stream.
Consider a stream that writes to a file. If there were no buffer, if your program wrote a single byte to the stream, you'd have to write a single byte to the file. That's very inefficient. So streams have buffers to decouple operations one on side of the stream from operations on the other side of the stream.
Ok lets lets start from the scratch suppose you want to work with files. For this purpose you would have to manage how the data is entered into your file and if the sending of data into the file was successful or not, and all other basic working problems. Now either you can manage all that on your own which would take a lots a time and hard work or What you can do is you can use a stream.
Yes, you can allocate a stream for such purposes. Streams work with abstraction mechanism i.e. we c++ programmers don't know how they are working but we only know that we are at the one side of a stream (on our program's side) we offer our data to the stream and it has the responsibility to transfer data from one end to the other (file's side)
Eg-
ofstream file("abc.txt"); //Here an object of output file stream is created
file<<"Hello"; //We are just giving our data to stream and it transfers that
file.close(); //The closing of file
Now if you work with files you should know that working with files is a really expensive operation i.e. it takes more time to access file than to access memory and we also don't have to perform file operations every time. So programmers created a new feature called buffer which is a part of computer's memory and stores data temporarily for handling files.
Suppose at the place of reading file every time to read data you just read some memory location where all the data of file is copied temporarily.Now it would be a less expensive task as you are reading memory not file.
Those streams that have a buffer for their working i.e. they open the file and by default copy all the data of file to the buffer are called as buffered streams whereas those streams which don't use any buffer are called as unbuffered streams.
Now if you enter data to a buffered stream then that data will be queued up until the stream is not flushed (flushing means replacing the data of buffer with that of file). Unbuffered streams are faster in working (from the point of user at one end of the stream) as data is not temporarily stored into a buffer and is sent to the file as it comes to the stream.
A buffer and a stream are different concepts.
A buffer is a part of the memory to temporarily store data. It can be implemented and structured in various ways. For instance, if one wants to read a very large file, chunks of the file can be read and stored in the buffer. Once a certain chunk is processed the data can be discarded and the next chunk can be read. A chunk in this case could be a line of the file.
Streams are the way C++ handles input and output. Their implementation uses buffers.
I do agree that stream is probably the poorest written and the most badly udnerstood part of standard library. People use it every day and many of them have not a slightest clue how the constructs they use work. For a little fun, try asking what is std::endl around - you might find that some answers are funny.
On any rate, streams and streambufs have different responsibilities. Streams are supposed to provide formatted input and output - that is, translate an integer to a sequence of bytes (or the other way around), and buffers are responsible of conveying the sequence of bytes to the media.
Unfortunately, this design is not clear from the implementation. For instance, we have all those numerous streams - file stream and string stream for example - while the only difference between those are the buffer. The stream code remains exactly the same. I believe, many people would redesign streams if they had their way, but I am afraid, this is not going to happen.
I have hypothetical scenario where a file handle opened in asynchronous mode, and some threads which are appending to that file handle. They append by setting the Offset and OffsetHigh parts of the OVERLAPPED structure to 0xFFFFFFFF, as documented in the MSDN article for WriteFile.
Can I issue a second write in append mode like this before the first append completes, and expect the file to contain the entire contents of the first append followed by the entire contents of the second append? Or must I wait to issue the following asynchronous write until the previous write completes?
Yes. It works. I worked at a company that used a similar scheme, although to get their seek calls to work each time, the predefined the file at a known size (about 2Gb...) then truncated the file at the end.
However, you can just "append" by going to the right location before each write. You'll have to handle the position yourself though.
And also each thread must access the file atomically, "of course."
A simple example:
lock mutex
seek to position
write data
position += data size
unlock mutex
Of course here I assume that the file is properly opened before you call this function from any thread.
The one thing that you cannot do, unless you create a large file first (which is very fast since files with all zeroes are created virtually), is seek at a position depending on something such as a frame number. So if thread 3 wants to write at "size * 3" and that happens before thread 2 writes at "size * 2" then the seek() will fail...
You should never issue multiple outstanding WriteFile operations with offset set to 0xFFFFFFFF even from a single thread. This will cause issue where multiple calls try to access the end of data at the same time and will lead to data corruption. This is due to the fact that if WriteFile operates in asynch mode and there are other outstanding WriteFile in process using the end of file, some operations might write data to the end of file and other outstanding operations will get wrong end of file pointer. In short you should use 0xFFFFFFFF only 1 time and wait for the operation to finish to issue another one using that offset. Otherwise you need to calculate the offsets yourself so that each outstanding write operation uses a unique offset. This bug took me 3 days to find about due to poor MSDN documentation about that offset.