Will fseek function flush data in the buffer in C++? - c++

We know that call to functions like fprintf or fwrite will not write data to the disk immediately, instead, the data will be buffered until a threshold is reached. My question is, if I call the fseek function, will these buffered data writen to disk before seeking to the new position? Or the data is still in the buffer, and is writen to the new position?
cheng

I'm not aware if the buffer is guaranteed to be flushed, it may not if you seek to a position close enough. However there is no way that the buffered data will be written to the new position. The buffering is just an optimization, and as such it has to be transparent.

Yes; fseek() ensures that the file will look like it should according to the fwrite() operations you've performed.
The C standard, ISO/IEC 9899:1999 §7.19.9.2 fseek(), says:
The fseek function sets the file position indicator for the stream pointed to by stream.
If a read or write error occurs, the error indicator for the stream is set and fseek fails.

I don't believe that it's specified that the data must be flushed on a fseek but when the data is actually written to disk it must be written at that position that the stream was at when the write function was called. Even if the data is still buffered, that buffer can't be written to a different part of the file when it is flushed even if there has been a subsequent seek.

It seems that your real concern is whether previously-written (but not yet flushed) data would end up in the wrong place in the file if you do an fseek.
No, that won't happen. It'll behave as you'd expect.

I have vague memories of a requirement that you call fflush before
fseek, but I don't have my copy of the C standard available to verify.
(If you don't it would be undefined behavior or implementation defined,
or something like that.) The common Unix standard specifies that:
If the most recent operation, other than ftell(), on a given stream is
fflush(), the file offset in the underlying open file description
shall be adjusted to reflect the location specified by fseek().
[...]
If the stream is writable and buffered data had not been written to
the underlying file, fseek() shall cause the unwritten data to be
written to the file and shall mark the st_ctime and st_mtime fields of
the file for update.
This is marked as an extention to the ISO C standard, however, so you can't count on it except on Unix platforms (or other platforms which make similar guarantees).

Related

Do we need mutex to perform multithreading file IO

I'm trying to do random write (Benchmark test) to a file using multiple threads (pthread). Looks like if I comment out mutex lock the created file size is less than actual as if Some writes are getting lost (always in some multiple of chunk size). But if I keep the mutex it's always exact size.
Is my code have a problem in other place and mutex is not really required (as suggested by #evan ) or mutex is necessary here
void *DiskWorker(void *threadarg) {
FILE *theFile = fopen(fileToWrite, "a+");
....
for (long i = 0; i < noOfWrites; ++i) {
//pthread_mutex_lock (&mutexsum);
// For Random access
fseek ( theFile , randomArray[i] * chunkSize , SEEK_SET );
fputs ( data , theFile );
//Or for sequential access (in this case above 2 lines would not be here)
fprintf(theFile, "%s", data);
//sequential access end
fflush (theFile);
//pthread_mutex_unlock(&mutexsum);
}
.....
}
You are opening a file using "append mode". According to C11:
Opening a file with append mode ('a' as the first character in the
mode argument) causes all subsequent writes to the file to be forced
to the then current end-of-file, regardless of intervening calls to
the fseek function.
C standard does not specified how exactly this should be implemented, but on POSIX system this is usually implemented using O_APPEND flag of open function, while flushing data is done using function write. Note that fseek call in your code should have no effect.
I think POSIX requires this, as it describes how redirecting output in append mode (>>) is done by the shell:
Appended output redirection shall cause the file whose name results
from the expansion of word to be opened for output on the designated
file descriptor. The file is opened as if the open() function as
defined in the System Interfaces volume of POSIX.1-2008 was called
with the O_APPEND flag. If the file does not exist, it shall be
created.
And since most programs use FILE interface to send data to stdout, this probably requires fopen to use open with O_APPEND and write (and not functions like pwrite) when writing data.
So if on your system fopen with 'a' mode uses O_APPEND and flushing is done using write and your kernel and filesystem correctly implement O_APPEND flag, using mutex should have no effect as writes do not intervene:
If the O_APPEND flag of the file status flags is set, the file
offset shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing
the file offset and the write operation.
Note that not all filesystems support this behavior. Check this answer.
As for my answer to your previous question, my suggestion was to remove mutex as it should have no effect on the size of a file (and it didn't have any effect on my machine).
Personally, I never really used O_APPEND and would be hesitant to do so, as its behavior might not be supported at some level, plus its behavior is weird on Linux (see "bugs" section of pwrite).
You definitely need a mutex because you are issuing several different file commands. The underlying file subsystem can't possibly know how many file commands you are going to call to complete your whole operation.
So you need the mutex.
In your situation you may find you get better performance putting the mutex outside the loop. The reason being that, otherwise, switching between threads may cause excessive skipping between different parts of the disk. Hard disks take about 10ms to move the read/write head so that could potentially slow things down a lot.
So it might be a good idea to benchmark that.

What will happen if the given offset in fseek goes beyond the last character

I'm currently using c++ and trying to write a file using fseek() in order to write on the given offset calculated from other methods. Just wondering what will happen if the given offset will make the FILE pointer go beyond the last character in the file.
Example:
In a file with "abcdefg" as the contents, what will fseek(someFILEpointer, 20, SEEK_SET) return?
From cppreference:
POSIX allows seeking beyond the existing end of file. If an output is performed after this seek, any read from the gap will return zero bytes. Where supported by the filesystem, this creates a sparse file.
It sounds like it should return a non-error status, but subsequent reads may fail. Subsequent writes may succeed, but the exact behavior may depend on the underlying filesystem.
The C standard leaves it implementation-defined whether such a call to fseek succeeds or not. If the file position cannot be set in the manner indicated, fseek will return an error indication.
From the C standard:
A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END. (§7.21.9.2/3)
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
So in neither case are you guaranteed to be able to call fseek with a non-zero offset and whence set to SEEK_END.
Posix does allow the call (quotes from the description of fseek):
The fseek() function shall allow the file-position indicator to be set beyond the end of existing data in the file. If data is later written at this point, subsequent reads of data in the gap shall return bytes with the value 0 until data is actually written into the gap.
(Posix leaves it up to the implementation whether the bytes with value 0 are actually stored, or are implicit. Most Unix file systems implement sparse files which can optimize this case by not storing the zeros on persistent storage, but this is not possible on a FAT filesystem, for example.)
Even Posix only makes this guarantee for regular files:
The behavior of fseek() on devices which are incapable of seeking is implementation-defined. The value of the file offset associated with such a device is undefined.
So the call may fail, but that is not undefined behaviour. If the repositioning is not possible, fseek will return a nonzero value; in the case of Posix implementations, the nonzero value will be -1 and errno will be set to a value which might help clarify the cause of the failure.
In linux (and unix in general), it would succed and return the new offset measured from the beginning of the file, but the file won't increase in size until you write something at that offset.
Your unwritten part will be read back as zeros from the file, but depending on OS and file system, some of the zeros might not have to occupy space on the harddrive.

What's the difference between read() and getc()

I have two code segments:
while((n=read(0,buf,BUFFSIZE))>0)
if(write(1,buf,n)!=n)
err_sys("write error");
while((c=getc(stdin))!=EOF)
if(putc(c,stdout)==EOF)
err_sys("write error");
Some sayings on internet make me confused. I know that standard I/O does buffering automatically, but I have passed a buf to read(), so read() is also doing buffering, right? And it seems that getc() read data char by char, how much data will the buffer have before sending all the data out?
Thanks
While both functions can be used to read from a file, they are very different. First of all on many systems read is a lower-level function, and may even be a system call directly into the OS. The read function also isn't standard C or C++, it's part of e.g. POSIX. It also can read arbitrarily sized blocks, not only one byte at a time. There's no buffering (except maybe at the OS/kernel level), and it doesn't differ between "binary" and "text" data. And on POSIX systems, where read is a system call, it can be used to read from all kind of devices and not only files.
The getc function is a higher level function. It usually uses buffered input (so input is read in blocks into a buffer, sometimes by using read, and the getc function gets its characters from that buffer). It also only returns a single characters at a time. It's also part of the C and C++ specifications as part of the standard library. Also, there may be conversions of the data read and the data returned by the function, depending on if the file was opened in text or binary mode.
Another difference is that read is also always a function, while getc might be a preprocessor macro.
Comparing read and getc doesn't really make much sense, more sense would be comparing read with fread.

When does output buffer flush?

Apart from manually calling flush, what is the condition that cout or STDOUT(printf) would flush?
Exiting the current scope or current function? Is it timed? Flush when the buffer is full (and how big is the buffer)?
For <stdio.h> streams you can set the buffering mode using setvbuf(). It takes three buffering modes:
_IOFBF: the buffer is flushed when it is full or when a flush is explicitly requested.
_IOLBF: the buffer is flushed when a newline is found, the buffer is full, or a flush is requested.
_IONBF: the stream is unbuffered, i.e., output is sent as soon as available.
I had the impressino that the default setup for stdout is _IOLBF, for stderr it is _IONBF, and for other streams it is _IOFBF. However, looking at the C standard I don't find any indication on what the default is for any C stream.
For the standard C++ stream objects there is no equivalent to _IOLBF: if you want line buffer you'd use std::endl or, preferrably, '\n' and std::flush. There are a few setups for std::ostream, though:
You generally can use buf.pubsetbuf(0, 0) to turn a stream to be unbuffered. Since stream buffers can be implemented by users, it isn't guaranteed that the corresponding call to set the buffer is honored, though.
You can set std::ios_base::unitbuf which causes the stream to be flushed after each [properly implemented] output operations. By default std::ios_base::unitbuf is only set for std::cerr.
The normal setup for an std::ostream to flush the buffer when the buffer is full or when explicitly requested where, unfortunately, std::endl makes an explicit request to flush the buffer (causing performance problems in many cases because it tends to be used as a surrogate for '\n' which it is not).
An interesting one is the ability to in.tie() an output buffer to an input stream: if in.tie() contains a pointer to an std::ostream this output stream will be flushed prior to an attempt to read from in (assuming correctly implemented input operators, of course). By default, std::cout is tie()d to std::cin.
Nearly forgot an important one: if std::ios_base::sync_with_stdio() wasn't called with false the standard C++ streams (std::cin, std::cout, std::cerr and std::clog and their wchar_t counterparts) are probably entirely unbuffered! With the default settings of std::ios_base::sync_with_stdio(true) the standard C and C++ streams can be used in a mixed way. However, since the C library is generally oblivious of the C++ library this means that the C++ standard stream objects can't do any buffering. Using std::sync_with_stdio(true) is the major performance problem with standard C++ stream objects!
Neither in C nor in C++ you can really control the size of buffers: the requests to set a non-zero buffer are allowed to be ignored and normally will be ignored. That is, the stream will pretty much be flushed at somewhat random places.

Are there cases where fseek/ftell can give the wrong file size?

In C or C++, the following can be used to return a file size:
const unsigned long long at_beg = (unsigned long long) ftell(filePtr);
fseek(filePtr, 0, SEEK_END);
const unsigned long long at_end = (unsigned long long) ftell(filePtr);
const unsigned long long length_in_bytes = at_end - at_beg;
fprintf(stdout, "file size: %llu\n", length_in_bytes);
Are there development environments, compilers, or OSes which can return the wrong file size from this code, based on padding or other information that is situation-specific? Were there changes in the C or C++ specification around 1999, which would have lead to this code no longer working in certain cases?
For this question, please assume I am adding large file support by compiling with the flags -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE=1. Thanks.
It won't work on unseekable files like /proc/cpuinfo or /dev/stdin or /dev/tty, or pipe files gotten with popen
And it won't work if that file is written by another process at the same time.
Using the Posix stat function is probably more efficient and more reliable. Of course, this function might not be available on non Posix systems.
The fseek and ftell functions are both defined by the ISO C language standard.
The following is from latest public draft of the 2011 C standard, but the 1990, 1999, and 2011 ISO C standards are all very similar in this area, if not identical.
7.21.9.4:
The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary stream,
the value is the number of characters from the beginning of the file.
For a text stream, its file position indicator contains unspecified
information, usable by the fseek function for returning the file
position indicator for the stream to its position at the time of the
ftell call; the difference between two such return values is not
necessarily a meaningful measure of the number of characters written
or read.
7.21.9.2:
The fseek function sets the file position indicator for the stream
pointed to by stream. If a read or write error occurs, the error
indicator for the stream is set and fseek fails.
For a binary stream, the new position, measured in characters from the
beginning of the file, is obtained by adding offset to the
position specified by whence. The specified position is the
beginning of the file if whence is SEEK_SET, the current value
of the file position indicator if SEEK_CUR, or end-of-file if
SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
For a text stream, either offset shall be zero, or offset
shall be a value returned by an earlier successful call to the
ftell function on a stream associated with the same file and whence shall be SEEK_SET.
Violating any of the "shall" clauses makes your program's behavior undefined.
So if the file was opened in binary mode, ftell gives you the number of characters from the beginning of the file -- but an fseek relative to the end of the file (SEEK_END) is not necessarily meaningful. This accommodates systems that store binary files in whole blocks and don't keep track of how much was written to the final block.
If the file was opened in text mode, you can seek to the beginning or end of the file with an offset of 0, or you can seek to a position given by an earlier call to ftell; fseek with any other arguments has undefined behavior. This accomodates systems where the number of characters read from a text file doesn't necessarily correspond to the number of bytes in the file. For example, on Windows reading a CR-LF pair ("\r\n") reads only one character, but advances 2 bytes in the file.
In practice, on Unix-like systems text and binary modes behave the same way, and the fseek/ftell method will work. I suspect it will work on Windows (my guess is that ftell will give the byte offset, which may not be the same as the number of times you could call getchar() in text mode).
Note also that ftell() returns a result of type long. On systems where long is 32 bits, this method can't work for files that are 2 GiB or larger.
You might be better off using some system-specific method to get the size of a file. Since the fseek/ftell method is system-specific anyway, such as stat() on Unix-like systems.
On the other hand, fseek and ftell are likely to work as you expect on most systems you're likely to encounter. I'm sure there are systems where it won't work; sorry, but I don't have specifics.
If working on Linux and Windows is good enough, and you're not concerned with large files, then the fseek/ftell method is probably ok. Otherwise, you should consider using a system-specific method to determine the size of a file.
And keep in mind that anything that tells you the size of a file can only tell you its size at that moment. The file's size could change before you access it.
1) Superficially, your code looks "OK" - I don't see any problem with it.
2) No - there isn't any "C or C++ specification" that would affect fseek. There is a Posix specification:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/fseek.html
3) If you want "file size", my first choice would probably by "stat()". Here's the Posix specification:
http://pubs.opengroup.org/onlinepubs/007904975/functions/stat.html
4) If something's "going wrong" with your method, then my first guess would be "large file support".
For example, many OS's had parallel "fseek()" and "fseek64()" APIs.
'Hope that helps .. PSM
POSIX defines the return value from fseek as "measured in bytes from the beginning of the file". Your at_beg will always be zero (assuming this is a newly opened file).
So, assuming that:
the file is seekable
there are no concurrency issues to be concerned about
the file size is representable in the data type used by the fseek/ftell variant you choose
then your code should work on any POSIX-compliant system.