Is a seek/sync required to switch between reading & writing an fstream?

Is a seek/sync required to switch between reading & writing an fstream? - c++

When using a std::fstream you can read and write to the file. I'm wondering if there is any part of the C++ standard stating the requirements for switching between read & write operations.
What I found is:
The restrictions on reading and writing a sequence controlled by an
object of class basic_filebuf<charT, traits> are the same as for
reading and writing with the C standard library FILEs.
In particular:
— If the file is not open for reading the input sequence cannot be read.
— If the file is not open for writing the output sequence cannot be written.
— A joint file position is maintained for both the input sequence and the output sequence
So according to only the bullet points if I open a std::fstream, read 5 bytes, then write 2 bytes the new bytes will end up at position 6 & 7: First 2 conditions are met and due to the "joint file position" the read will advance the write position by 5 bytes.
However in practice intermediate buffers are used instead of directly calling fread/fwrite for every single byte. Hence the 5-byte-read could read 512 bytes from the underlying FILE and if that is not accounted for during the write that will end up at position 513 not 6 in the above example.
And indeed libstdc++ (GCC) does correct the offsets: https://github.com/gcc-mirror/gcc/blob/13ea4a6e830da1f245136601e636dec62e74d1a7/libstdc%2B%2B-v3/include/bits/fstream.tcc#L554
I haven't found anything similar in libc++ (Clang) or MS STL, although MS "cheats" by using the buffers from the FILE* inside the filebuf, so it might "work" there.
Anyway: For the C functions I found that switching between read & write requires a seek or flush (after write): https://en.cppreference.com/w/c/io/fopen
So does the part "The restrictions [...] of class basic_filebuf<charT, traits> are the same as [...] with the C standard library" may also be applied to this, so a seekg, seekp, tellg, tellp or sync is required to switch between operations?
Context: I maintain a library providing an (enhanced) filebuf class and not checking if the FILE* position needs to be adjusted to account for buffering due to potential read/write switches without seek/sync saves a potentially expensive conditional. But I don't want to deviate from the standard requirements to not break users expecting a drop-in replacement.

Related

What will happen if the given offset in fseek goes beyond the last character

I'm currently using c++ and trying to write a file using fseek() in order to write on the given offset calculated from other methods. Just wondering what will happen if the given offset will make the FILE pointer go beyond the last character in the file.
Example:
In a file with "abcdefg" as the contents, what will fseek(someFILEpointer, 20, SEEK_SET) return?

From cppreference:
POSIX allows seeking beyond the existing end of file. If an output is performed after this seek, any read from the gap will return zero bytes. Where supported by the filesystem, this creates a sparse file.
It sounds like it should return a non-error status, but subsequent reads may fail. Subsequent writes may succeed, but the exact behavior may depend on the underlying filesystem.

The C standard leaves it implementation-defined whether such a call to fseek succeeds or not. If the file position cannot be set in the manner indicated, fseek will return an error indication.
From the C standard:
A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END. (§7.21.9.2/3)
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
So in neither case are you guaranteed to be able to call fseek with a non-zero offset and whence set to SEEK_END.
Posix does allow the call (quotes from the description of fseek):
The fseek() function shall allow the file-position indicator to be set beyond the end of existing data in the file. If data is later written at this point, subsequent reads of data in the gap shall return bytes with the value 0 until data is actually written into the gap.
(Posix leaves it up to the implementation whether the bytes with value 0 are actually stored, or are implicit. Most Unix file systems implement sparse files which can optimize this case by not storing the zeros on persistent storage, but this is not possible on a FAT filesystem, for example.)
Even Posix only makes this guarantee for regular files:
The behavior of fseek() on devices which are incapable of seeking is implementation-defined. The value of the file offset associated with such a device is undefined.
So the call may fail, but that is not undefined behaviour. If the repositioning is not possible, fseek will return a nonzero value; in the case of Posix implementations, the nonzero value will be -1 and errno will be set to a value which might help clarify the cause of the failure.

In linux (and unix in general), it would succed and return the new offset measured from the beginning of the file, but the file won't increase in size until you write something at that offset.
Your unwritten part will be read back as zeros from the file, but depending on OS and file system, some of the zeros might not have to occupy space on the harddrive.

What's the difference between read() and getc()

I have two code segments:
while((n=read(0,buf,BUFFSIZE))>0)
if(write(1,buf,n)!=n)
err_sys("write error");
while((c=getc(stdin))!=EOF)
if(putc(c,stdout)==EOF)
err_sys("write error");
Some sayings on internet make me confused. I know that standard I/O does buffering automatically, but I have passed a buf to read(), so read() is also doing buffering, right? And it seems that getc() read data char by char, how much data will the buffer have before sending all the data out?
Thanks

While both functions can be used to read from a file, they are very different. First of all on many systems read is a lower-level function, and may even be a system call directly into the OS. The read function also isn't standard C or C++, it's part of e.g. POSIX. It also can read arbitrarily sized blocks, not only one byte at a time. There's no buffering (except maybe at the OS/kernel level), and it doesn't differ between "binary" and "text" data. And on POSIX systems, where read is a system call, it can be used to read from all kind of devices and not only files.
The getc function is a higher level function. It usually uses buffered input (so input is read in blocks into a buffer, sometimes by using read, and the getc function gets its characters from that buffer). It also only returns a single characters at a time. It's also part of the C and C++ specifications as part of the standard library. Also, there may be conversions of the data read and the data returned by the function, depending on if the file was opened in text or binary mode.
Another difference is that read is also always a function, while getc might be a preprocessor macro.
Comparing read and getc doesn't really make much sense, more sense would be comparing read with fread.

Are there cases where fseek/ftell can give the wrong file size?

In C or C++, the following can be used to return a file size:
const unsigned long long at_beg = (unsigned long long) ftell(filePtr);
fseek(filePtr, 0, SEEK_END);
const unsigned long long at_end = (unsigned long long) ftell(filePtr);
const unsigned long long length_in_bytes = at_end - at_beg;
fprintf(stdout, "file size: %llu\n", length_in_bytes);
Are there development environments, compilers, or OSes which can return the wrong file size from this code, based on padding or other information that is situation-specific? Were there changes in the C or C++ specification around 1999, which would have lead to this code no longer working in certain cases?
For this question, please assume I am adding large file support by compiling with the flags -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE=1. Thanks.

It won't work on unseekable files like /proc/cpuinfo or /dev/stdin or /dev/tty, or pipe files gotten with popen
And it won't work if that file is written by another process at the same time.
Using the Posix stat function is probably more efficient and more reliable. Of course, this function might not be available on non Posix systems.

The fseek and ftell functions are both defined by the ISO C language standard.
The following is from latest public draft of the 2011 C standard, but the 1990, 1999, and 2011 ISO C standards are all very similar in this area, if not identical.
7.21.9.4:
The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary stream,
the value is the number of characters from the beginning of the file.
For a text stream, its file position indicator contains unspecified
information, usable by the fseek function for returning the file
position indicator for the stream to its position at the time of the
ftell call; the difference between two such return values is not
necessarily a meaningful measure of the number of characters written
or read.
7.21.9.2:
The fseek function sets the file position indicator for the stream
pointed to by stream. If a read or write error occurs, the error
indicator for the stream is set and fseek fails.
For a binary stream, the new position, measured in characters from the
beginning of the file, is obtained by adding offset to the
position specified by whence. The specified position is the
beginning of the file if whence is SEEK_SET, the current value
of the file position indicator if SEEK_CUR, or end-of-file if
SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
For a text stream, either offset shall be zero, or offset
shall be a value returned by an earlier successful call to the
ftell function on a stream associated with the same file and whence shall be SEEK_SET.
Violating any of the "shall" clauses makes your program's behavior undefined.
So if the file was opened in binary mode, ftell gives you the number of characters from the beginning of the file -- but an fseek relative to the end of the file (SEEK_END) is not necessarily meaningful. This accommodates systems that store binary files in whole blocks and don't keep track of how much was written to the final block.
If the file was opened in text mode, you can seek to the beginning or end of the file with an offset of 0, or you can seek to a position given by an earlier call to ftell; fseek with any other arguments has undefined behavior. This accomodates systems where the number of characters read from a text file doesn't necessarily correspond to the number of bytes in the file. For example, on Windows reading a CR-LF pair ("\r\n") reads only one character, but advances 2 bytes in the file.
In practice, on Unix-like systems text and binary modes behave the same way, and the fseek/ftell method will work. I suspect it will work on Windows (my guess is that ftell will give the byte offset, which may not be the same as the number of times you could call getchar() in text mode).
Note also that ftell() returns a result of type long. On systems where long is 32 bits, this method can't work for files that are 2 GiB or larger.
You might be better off using some system-specific method to get the size of a file. Since the fseek/ftell method is system-specific anyway, such as stat() on Unix-like systems.
On the other hand, fseek and ftell are likely to work as you expect on most systems you're likely to encounter. I'm sure there are systems where it won't work; sorry, but I don't have specifics.
If working on Linux and Windows is good enough, and you're not concerned with large files, then the fseek/ftell method is probably ok. Otherwise, you should consider using a system-specific method to determine the size of a file.
And keep in mind that anything that tells you the size of a file can only tell you its size at that moment. The file's size could change before you access it.

1) Superficially, your code looks "OK" - I don't see any problem with it.
2) No - there isn't any "C or C++ specification" that would affect fseek. There is a Posix specification:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/fseek.html
3) If you want "file size", my first choice would probably by "stat()". Here's the Posix specification:
http://pubs.opengroup.org/onlinepubs/007904975/functions/stat.html
4) If something's "going wrong" with your method, then my first guess would be "large file support".
For example, many OS's had parallel "fseek()" and "fseek64()" APIs.
'Hope that helps .. PSM

POSIX defines the return value from fseek as "measured in bytes from the beginning of the file". Your at_beg will always be zero (assuming this is a newly opened file).
So, assuming that:
the file is seekable
there are no concurrency issues to be concerned about
the file size is representable in the data type used by the fseek/ftell variant you choose
then your code should work on any POSIX-compliant system.

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?

For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.

I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.

Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.

Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.

Will fseek function flush data in the buffer in C++?

We know that call to functions like fprintf or fwrite will not write data to the disk immediately, instead, the data will be buffered until a threshold is reached. My question is, if I call the fseek function, will these buffered data writen to disk before seeking to the new position? Or the data is still in the buffer, and is writen to the new position?
cheng

I'm not aware if the buffer is guaranteed to be flushed, it may not if you seek to a position close enough. However there is no way that the buffered data will be written to the new position. The buffering is just an optimization, and as such it has to be transparent.

Yes; fseek() ensures that the file will look like it should according to the fwrite() operations you've performed.
The C standard, ISO/IEC 9899:1999 §7.19.9.2 fseek(), says:
The fseek function sets the file position indicator for the stream pointed to by stream.
If a read or write error occurs, the error indicator for the stream is set and fseek fails.

I don't believe that it's specified that the data must be flushed on a fseek but when the data is actually written to disk it must be written at that position that the stream was at when the write function was called. Even if the data is still buffered, that buffer can't be written to a different part of the file when it is flushed even if there has been a subsequent seek.

It seems that your real concern is whether previously-written (but not yet flushed) data would end up in the wrong place in the file if you do an fseek.
No, that won't happen. It'll behave as you'd expect.

I have vague memories of a requirement that you call fflush before
fseek, but I don't have my copy of the C standard available to verify.
(If you don't it would be undefined behavior or implementation defined,
or something like that.) The common Unix standard specifies that:
If the most recent operation, other than ftell(), on a given stream is
fflush(), the file offset in the underlying open file description
shall be adjusted to reflect the location specified by fseek().
[...]
If the stream is writable and buffered data had not been written to
the underlying file, fseek() shall cause the unwritten data to be
written to the file and shall mark the st_ctime and st_mtime fields of
the file for update.
This is marked as an extention to the ISO C standard, however, so you can't count on it except on Unix platforms (or other platforms which make similar guarantees).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is a seek/sync required to switch between reading & writing an fstream? - c++

Related

What will happen if the given offset in fseek goes beyond the last character

What's the difference between read() and getc()

Are there cases where fseek/ftell can give the wrong file size?

Unexpected "padding" in a Fortran unformatted file

Will fseek function flush data in the buffer in C++?

Categories

Resources