Lots of people want to switch off buffering on their file descriptors. I want the reverse: I deliberately want to configure a file descriptor to buffer, say, 1K of data before writing to disk.
The reason is that I'm writing a unit test for a "flush" function of a C++ class. To test that it's working I want to write some data, check the size of the file on disk, then flush, then check that the size has grown. But in practice, by the time I do the first file size check the data has already been written.
Note that I'm working with a raw file descriptor here, not a stream or anything.
This is on linux if that matters.
How to force a file descriptor to buffer my output
If you're using the POSIX write() (or a variant of it), you can't.
The write() call must behave thus:
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the
write() for that position until such byte positions are again
modified.
Any subsequent successful write() to the same byte position in the file shall overwrite that file data.
Those requirements mean the data written is visible to any other process on the system, and to be consistent, if the data written causes the file size to grow, the file size reported by the kernel must reflect the data written.
I want to write some data, check the size of the file on disk, then flush, then check that the size has grown.
That fundamentally doesn't work with write(). The file size will grow as the data written - write() does not buffer data.
If you want it to do that, you'll have to implement your own filesystem - one that isn't POSIX compliant.
Related
My Operating Systems professor was talking today about how a read system call is unbuffered while a istream::read function has a buffer. This left me a bit confused as you still make a buffer for the istream::read function when using it.
The only thing I can think of is that there are more than one buffers in the istream::read function call. Why?
What does the istream::read() function do differently from the read() function system call?
The professor was talking about buffers internal to the istream rather than the buffer provided by the calling code where the data ends up after the read.
As an example, say you are reading individual int objects out of an istream, the istream is likely to have an internal buffer where some number of bytes is stored and the next read can be satisfied out of that rather than going to the OS. Note, however, that whatever the istream is hooked to very likely has internal buffers as well. Most OSes have means to perform zero-copy reads (that is, read directly from the I/O source to your buffer), but that facility comes with severe restrictions (read size must be multiple of some particular number of bytes, and if reading from a disk file the file pointer must also be on a multiple of that byte count). Most of the time such zero-copy reads are not worth the hassle.
Although i have read about buffer and stream and it's working with files in c++ but i don't know what is the need of a buffer if a stream is there, stream is always there to transfer the data of one file to the program. So why do we use buffers to store data(performing same task that stream does) and what are buffered and unbuffered stream.
Consider a stream that writes to a file. If there were no buffer, if your program wrote a single byte to the stream, you'd have to write a single byte to the file. That's very inefficient. So streams have buffers to decouple operations one on side of the stream from operations on the other side of the stream.
Ok lets lets start from the scratch suppose you want to work with files. For this purpose you would have to manage how the data is entered into your file and if the sending of data into the file was successful or not, and all other basic working problems. Now either you can manage all that on your own which would take a lots a time and hard work or What you can do is you can use a stream.
Yes, you can allocate a stream for such purposes. Streams work with abstraction mechanism i.e. we c++ programmers don't know how they are working but we only know that we are at the one side of a stream (on our program's side) we offer our data to the stream and it has the responsibility to transfer data from one end to the other (file's side)
Eg-
ofstream file("abc.txt"); //Here an object of output file stream is created
file<<"Hello"; //We are just giving our data to stream and it transfers that
file.close(); //The closing of file
Now if you work with files you should know that working with files is a really expensive operation i.e. it takes more time to access file than to access memory and we also don't have to perform file operations every time. So programmers created a new feature called buffer which is a part of computer's memory and stores data temporarily for handling files.
Suppose at the place of reading file every time to read data you just read some memory location where all the data of file is copied temporarily.Now it would be a less expensive task as you are reading memory not file.
Those streams that have a buffer for their working i.e. they open the file and by default copy all the data of file to the buffer are called as buffered streams whereas those streams which don't use any buffer are called as unbuffered streams.
Now if you enter data to a buffered stream then that data will be queued up until the stream is not flushed (flushing means replacing the data of buffer with that of file). Unbuffered streams are faster in working (from the point of user at one end of the stream) as data is not temporarily stored into a buffer and is sent to the file as it comes to the stream.
A buffer and a stream are different concepts.
A buffer is a part of the memory to temporarily store data. It can be implemented and structured in various ways. For instance, if one wants to read a very large file, chunks of the file can be read and stored in the buffer. Once a certain chunk is processed the data can be discarded and the next chunk can be read. A chunk in this case could be a line of the file.
Streams are the way C++ handles input and output. Their implementation uses buffers.
I do agree that stream is probably the poorest written and the most badly udnerstood part of standard library. People use it every day and many of them have not a slightest clue how the constructs they use work. For a little fun, try asking what is std::endl around - you might find that some answers are funny.
On any rate, streams and streambufs have different responsibilities. Streams are supposed to provide formatted input and output - that is, translate an integer to a sequence of bytes (or the other way around), and buffers are responsible of conveying the sequence of bytes to the media.
Unfortunately, this design is not clear from the implementation. For instance, we have all those numerous streams - file stream and string stream for example - while the only difference between those are the buffer. The stream code remains exactly the same. I believe, many people would redesign streams if they had their way, but I am afraid, this is not going to happen.
I have a C++ application that handles multiple connections and writes to file. When using function _sopen_s with parameter _SH_DENYNO all threads that are working simultaneously are writing in the file and I see no loss of data. Can you tell me how the access to the file is managed by the function so there is no loss of data ?
Kind regards
If you're using write(), or some OS-supplied variant of it, individual write() calls tend to be implemented such that each call to write() is atomic because of the implications of this POSIX statement regarding write():
On a regular file or other file capable of seeking, the actual writing
of data shall proceed from the position in the file indicated by the
file offset associated with fildes. Before successful return from
write(), the file offset shall be incremented by the number of bytes
actually written. On a regular file, if this incremented file offset
is greater than the length of the file, the length of the file shall
be set to this file offset.
That's somewhat hard to implement in a way that allows for interleaved data from separate write() calls on the same file descriptor, because if data is interleaved the change in the file offset upon completion wouldn't be equal to the "number of bytes actually written". So that statement could be interpreted as an implied requirement for the atomicity of write() calls on any "regular file or other file capable of seeking".
Also, there's the explicit POSIX requirement that write() calls to pipes of less than or equal to PIPE_BUF bytes be atomic:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process.
This is useful when there are multiple writers sending data to a
single reader. Applications need to know how large a write request can
be expected to be performed atomically. This maximum is called
{PIPE_BUF}. This volume of POSIX.1-2008 does not say whether write
requests for more than {PIPE_BUF} bytes are atomic, but requires that
writes of {PIPE_BUF} or fewer bytes shall be atomic.
Since write() just gets an int for a file descriptor with no other information directly available as to what the file descriptor refers to, the simplest way to implement write() in a way that meets the POSIX requirement for atomic write() to a pipe is to make each write() call atomic.
So, while there's no requirement for atomicity unless you're writing less than or equal to PIPE_BUF bytes to an actual pipe, write() tends to be implemented atomically for everything.
Now, that doesn't mean that whatever the file descriptor points to won't break the data up or interleave it with other data. For example, I wouldn't be surprised at all to see interleaved data if two threads were each to try calling one write() operation to stream several GB of data from MPEG files simultaneously to the same TCP socket.
And you're not actually calling write(). But the underlying implementation is likely shared.
I am writing a large binary output buffer through ofstream::write(). Since I know the size of the output file, but sometimes have to write it in chunks, I thought it would be a good idea to call fallocate() (or posix_fallocate()) first to preallocate the buffer on disk. Those do, however, require a file descriptor, which ofstream does not provide me with.
Is there an ofstream interface for calling fallocate(), or possibly to get the underlying file descriptor so that I can call it myself? (Or is it not worth the bother?)
Since you going to write in chunks use fwrite
also see http://www.cplusplus.com/reference/cstdio/setvbuf/ to control buffer size
to optimize you can have buffer size = N * chunk size
Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).