There is FlushFileBuffers() API in Windows to flush buffers till hard drive for a single file. There is sync() API in Linux to flush file buffers for all files.
However, is there WinAPI for flushing all files too, i.e. a sync() analog?
https://learn.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebuffers
It is possible to flush the entire hard drive.
To flush all open files on a volume, call FlushFileBuffers with a handle to the volume. The caller must have administrative privileges. For more information, see Running with Special Privileges.
Also, the same article states the correct procedure to follow if, for some reason, data must be flushed: CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags.
Due to disk caching interactions within the system, the FlushFileBuffers function can be inefficient when used after every write to a disk drive device when many writes are being performed separately. If an application is performing multiple writes to disk and also needs to ensure critical data is written to persistent media, the application should use unbuffered I/O instead of frequently calling FlushFileBuffers. To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
But also check the restrictions of file buffering about memory and data alignment.
According to File Management Functions there is no any sync() analog from Linux in WinAPI.
Related
The question below may sound a bit long and complex, but actually it's a quite simple, generic and common problem of three processes working on a same file. In a text below I'm trying to decompose the problem into set of particular requirements with some illustrative examples.
Task preamble
There is a text file, called index, which contains some metadata.
There is an application (APP), which understands the file format and perform meaningful changes on it.
The file is stored under version control system (VCS), which is a source of changes performed on the same file by other users.
We need to design an application (APP), that will work with the file in a reasonable file, preferable without interring much with VCS, as it's assumed that VCS is used to keep a large project with the index file being just a small part of it, and user may want to update the VCS at any point without considering any ongoing operations within APP. In that case APP should gracefully handle the situation in a way preventing any possible loss of data.
Preable remarks
Please note that VCS is unspecified, it could be perforce, git, svn, tarballs, flash drives or your favourite WWII Morse-based radio and a text editor.
Text file could be binary, that doesn't change things much. But with VCS storage in mind, it's prone to be merged and therefore text/human-readable format is most adequate.
Possible examples for such things are: complex configurations (AI behaviour trees, game object descriptions), resource listings, other things that are not meant to be edited by hand, related to a project at hand, but which history matters.
Note that, unless you are keen to implement your own version control system, "outsourcing" most of the configuration into some external, client-server based solution does not solve the problem - you still have to keep a reference file within version control system with a reference to a matching version of configuration in question in the database. Which means, that you still have the same problem, but at a bit smaller scale - a single text line in a file instead of a dozen.
The task itself
A generic APP in vacuum may the index in three phases: read, modify, write. The read phase - read and de-serialize the file, modify - change an in-memory state, write - serialize the state and write to the file.
There are three kind of generic workflows for such an application:
read -> <present an information>
read -> <present an information and await user's input> -> modify -> write
read -> modify -> write
The first workflow is for read-only "users", like a game client, which reads data once and forgets about the file.
The second workflow is for editing application. With external updates being rather rare occurrence and being improbable that user will edit the same file in few editing applications at the same time, it's only reasonable to assume, that a generic editing application will want to read the state only once (especially if it's a resource-consuming operation) and re-read only in case of external updates.
The third workflow is for an automated cli usage - build servers, scripts and such.
With that in mind, it's reasonable to threat read and modify + write separately. Let's call an operation that makes only read phase and prepares some information a read operation. And a write operation would be an operation that modifies a state from a read operation and writes it to the disk.
As workflows one and two may be running at the same time by different application instances, it's also reasonable to allow multiple read operations running at the same time. Some read operations, like reads for editing applications, may want to wait until any existing write operations are finished to read the most recent and up-to-date state. Other read operations, like this in a game client may want to read the current state, whatever it is, without being blocked at all.
On other hand, it's only reasonable for write operations to detect any other write operations running and abort. Write operations should also detect any external changes made to the index file and abort. Rationale - there is no point to perform (and wait for) any work, that would be thrown away due to the fact that they've been made basing on a possible out-of-date state.
For a robust application, a possibility for a critical failure of a galaxy scale should be assumed at every single point of an application. Under no circumstances such a failure should left the index file inconsistent.
Requirements
file reads are consistent - under no circumstances should we read a half of a file before it have been changed or an another half after.
write operations are exclusive - no other write operations are allowed at the same time with the same file.
write operations are robustly waitable - we should be able to wait for a write operation to complete or fail.
write operations are transactional - under no circumstances should the file be left in partially changed or otherwise inconsistent state or based on an out-of-date state. Any change to the index file prior or during the operation should be detected and operation should be aborted as soon as possible.
Linux
A read operation:
Obtain a shared lock, if requested - open(2) (O_CREAT | O_RDONLY) and flock(2) (LOCK_SH) the "lock" file.
open(2) (O_RDONLY) the index file.
Create contents snapshot and parse it.
close(2) the index file.
Unlock - flock(2) (LOCK_UN) and close(2) the "lock" file
A write operation:
Obtain an exclusive lock - open(2) (O_CREAT | O_RDONLY) and flock(2) (LOCK_EX) the "lock" file.
open(2) (O_RDONLY) the index file.
fcntl(2) (F_SETLEASE, F_RDLCK) the index file. - we are only interested in writes, those RDLCK lease.
Check if the state is up-to-date, do things, change the state, write it to a temporary file nearby.
rename(2) the temporary file to the index - it's atomic, and if we haven't got a lease break so far, we won't at all - this will be a different file, not the one we've got the lease on.
fcntl(2) (*F_SETLEASE, F_UNLCK) the index file.
close(2) the index file (the "old" one, with no reference in the filesystem left)
Unlock - close(2) the "lock" file
If a signal from the lease is received - abort and cleanup, no rename. rename(2) has no mention that it might be interrupted and POSIX requires it to be atomic, so once we've got to it - we've made it.
I know there are shared-memory mutexes and named semaphores (instead of an advisory locking for cooperation between application instances), but I think we all agree, that they are needlessly complex for the task at hand and have their own problems.
Windows
A read operation:
Obtain a shared lock, if requested - CreateFile (OPEN_ALWAYS, GENERIC_READ, FILE_SHARE_READ) and LockFileEx (1 byte) the "lock" file
CreateFile (OPEN_EXISTING, GENERIC_READ, FILE_SHARE_READ) the index file
Read file contents
CloseHandle the index
Unlock - CloseHandle the "lock" file
A write operation:
Obtain an exclusive lock - CreateFile (OPEN_ALWAYS, GENERIC_READ, FILE_SHARE_READ) and LockFileEx (LOCKFILE_EXCLUSIVE_LOCK, 1 byte) the "lock" file
CreateFile (OPEN_EXISTING, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE ) the index file
ReadDirectoryChanges (FALSE, FILE_NOTIFY_CHANGE_LAST_WRITE) on the index file directory, with OVERLAPPED structure and an event
Check the state is up-to-date. Modify the state. Write it a temporary file
Replace the index file with a temporary
CloseHandle the index
Unlock - CloseHandle the "lock" file
During the modification part check for the event from the OVERLAPPED structure with WaitForSingleObject (zero timeout). If there are events for the index - abort the operation. Otherwise - fire the watch again, check if we are still up-to-date and if so - continue.
Remarks
Windows version use locking instead of Linux version's notification mechanism, which may interfere with outside processes making writes, but there is seemingly no other way in Windows.
In Linux, you can also use mandatory file locking.
See "Semantics" section:
If a process has locked a region of a file with a mandatory read lock, then
other processes are permitted to read from that region. If any of these
processes attempts to write to the region it will block until the lock is
released, unless the process has opened the file with the O_NONBLOCK
flag in which case the system call will return immediately with the error
status EAGAIN.
and:
If a process has locked a region of a file with a mandatory write lock, all
attempts to read or write to that region block until the lock is released,
unless a process has opened the file with the O_NONBLOCK flag in which case
the system call will return immediately with the error status EAGAIN.
With this approach, the APP may set read or write lock on file, and VCS will be blocked until lock is released.
Note that neither mandatory locks, nor file leases will work good if VCS can unlink() index file or replace it using rename():
If you use mandatory locks, VCS won't be blocked.
If you use file leases, APP won't get notification.
You also can't establish locks or leases on a directory. What you can do in this case:
After read operation, APP can manually check that file still exist and has the same i-node.
But it's not enough for write operations. Since APP can't atomically check file i-node and modify file, it can accidentally overwrite changes made by VCS without being able to detect it. You probably can detect this situation using inotify(7).
We have a project where multiple nodes writes a data to a file in sequence and the file resides on NFS.
We were using synchronous NFS before so the flush to file streams just worked fine. Now we have asynchronous NFS and its not working. Not working in a sense obviously the caching comes into picture and other nodes doesnt see the changes made by a particular node.
I wanted to know if there is a way to forcefully flush the data from the cache to disk. I know this is not efficient but it will get things working until we get the real solution in place.
I've had a similar problem using NFS with VxWorks. After some experimentation I've found a way to surely flush data to the device:
int fd;
fd = open("/ata0a/test.dat", O_RDWR | O_CREATE);
write(fd, "Hallo", 5);
/* data is having a great time in some buffers... */
ioctl(fd, FIOSYNC, 0); // <-- may last quite a while...
/* data is flushed to file */
I've never worked with ofstreams neither do I know if your OS provides something similar to the code shown above...
But one thing to try is to simply close the file. This will cause all buffers to be flushed. But be aware that there may be some time between closing the file and all data being flushed which your application does not see since the "close" call may return before the data is written. Additionally this creates a lot of overhead since you have to re-open the file afterwards.
If this is no option you can also write as much "dummy-data" after your data to cause the buffers to fill up. This will also result in the data being written to the file. But this may waste a lot of disk space depending on the size of your data.
I'm working on a ACID database software product and I have some questions about file durability on WinOS.
CreateFile has two attributes, FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING - do I need both these to achieve file durability (ie. override all kinds of disk or OS file caching)? I'm asking since they seem to do the same thing, and setting FILE_FLAG_NO_BUFFERING causes WriteFile to throw an ERROR_INVALID_PARAMETER error.
FILE_FLAG_NO_BUFFERING specifies no caching at al. No read nor write cache all data goes directly to and from your application to disk. This is mostly usefull if you read such large chunks that caching is useless or you do your own caching. Note WhozCraig's comment on properly aligning your data when using this flag.
FILE_FLAG_WRITE_THROUGH only means that writes should written directly to disk before the function returns. This is enough to achieve ACID while it still gives the option to the OS to cache data from the file.
Using FlushFileBuffers() can provide a more efficient approach for achieving ACID as you can do several writes to a file and then flush them in one go. Combining writes in one flush is very important as non cached writes will limit you to the spindle speed of your harddrive. 120 non cached writes or flushes per second max for a 7200 rpm disk.
I have a question about the buffering in standard library for I/O:
I read "The Linux Programming Interface" chapter 13 about File I/O buffering, the author mentioned that standard library used I/O buffering for disk file and terminal.
My question is that does this I/O buffering also apply to FIFO, pipe, socket and network file?
Yes, if you're using the FILE * based standard I/O library. The only odd thing that might happen is if the underlying system file descriptor returns non-zero for the isatty function. Then stdio might 'line buffer' both input and output. This means it tends to flush when it sees a '\n'.
I believe that it's required to line buffer stdout if file descriptor 1 returns non-zero for isatty.
No. Anything that's an ordinary file descriptor (such as those returned by open(2), pipe(2), socket(2), and accept(2)) is not buffered—any data you read or write to it is input or output immediately via direct system calls.
Buffering only happens when you have FILE* objects, which you can get by fopen(3)'ing a regular disk file; the objects stdin, stdout, and stderr are also FILE* objects that are setup at program start. Buffering is usually enabled on FILE* objects, but not always—it can be disabled with setbuf(3), and stderr is unbuffered by default.
If you want to create a buffered stream out of a regular file descriptor, you can do so with fdopen(3).
When I run this program
OVERLAPPED o;
int main()
{
..
CreateIoCompletionPort(....);
for (int i = 0; i<10; i++)
{
WriteFile(..,&o);
OVERLAPPED* po;
GetQueuedCompletionStatus(..,&po);
}
}
it seems that the WriteFile didn't return until the writing job is done. At the same time , GetQueuedCompletionStatus() gets called. The behavior is like a synchronous IO operation rather than an asynch-IO operation.
Why is that?
If the file handle and volume have write caching enabled, the file operation may complete with just a memory copy to cache, to be flushed lazily later. Since there is no actual IO taking place, there's no reason to do async IO in that case.
Internally, each IO operation is represented by an IRP (IO request packet). It is created by the kernel and given to the filesystem to handle the request, where it passes down through layered drivers until the request becomes an actual disk controller command. That driver will make the request, mark the IRP as pending and return control of the thread. If the handle was opened for overlapped IO, the kernel gives control back to your program immediately. Otherwise, the kernel will wait for the IRP to complete before returning.
Not all IO operations make it all the way to the disk, however. The filesystem may determine that the write should be cached, and not written until later. There is even a special path for operations that can be satisfied entirely using the cache, called fast IO. Even if you make an asynchronous request, fast IO is always synchronous because it's just copying data into and out of cache.
Process monitor, in advanced output mode, displays the different modes and will show blank in the status field while an IRP is pending.
There is a limit to how much data is allowed to be outstanding in the write cache. Once it fills up, the write operations will not complete immediately. Try writing a lot of data at once, with may operations.
I wrote a blog posting a while back entitled "When are asynchronous file writes not asynchronous" and the answer was, unfortunately, "most of the time". See the posting here: http://www.lenholgate.com/blog/2008/02/when-are-asynchronous-file-writes-not-asynchronous.html
The gist of it is:
For security reasons Windows extends files in a synchronous manner
You can attempt to work around this by setting the end of the file to a large value before you start and then trimming the file to the correct size when you finish.
You can tell the cache manager to use your buffers and not its, by using FILE_FLAG_NO_BUFFERING
At least it's not as bad as if you're forced to use FILE_FLAG_WRITE_THROUGH
If GetQueuedCompletionStatus is being called, then the call to WriteFile is synchronous (and it has returned), but it can still modify &o even after it's returned if it is asynchronous.
from this page in MSDN:
For asynchronous write operations,
hFile can be any handle opened with
the CreateFile function using the
FILE_FLAG_OVERLAPPED flag or a socket
handle returned by the socket or
accept function.
also, from this page:
If a handle is provided, it has to
have been opened for overlapped I/O
completion. For example, you must
specify the FILE_FLAG_OVERLAPPED flag
when using the CreateFile function to
obtain the handle.