File modification time gets overwritten by background cache flushing - c++

I have code that performs following steps:
open file
write data
set file timestamps (via SetFileInformationByHandle(FileBasicInfo))
close file
When file is stored on certain NAS devices (and accessed via share) it's modification time ends up being set to current time.
According to Process Monitor Close() in step 4 results in a Write (local cache gets flushed/pushed to NAS device) that (seemingly) updates file's mtime on server.
If I add FlushFileBuffers() (or sleep for few seconds) between steps 2 and 3 -- everything is fine.
Is this a bug in SMB implementation of this NAS device (Dell EMC Isilon) or SetFileInformationByHandle() never promised anything?
What is the best way to deal with this situation? I would really like to avoid having to call FlushFileBuffers()...
Edit: Great... :-/ It looks like for executables (and only executables) atime (last access time) gets screwed up too (in the same way). Only these are harder to reproduce -- need to run this logic few times. Could be some antivirus... I am still investigating.
Edit 2: According to procmon access time gets updated by EXPLORER.EXE -- when it sees an executable, it can't resist opening it and reading portions of it (probably extracting the icon).

You can't really do anything -- I guess Isilon's SMB implementation doesn't support certain things (that would've preserved timestamps).
I simply added FlushFileBuffers() before SetFileInformationByHandle() and made sure there are no related race conditions in my code.

Related

Manipulating large files that are being used

I want to retrieve handles of very large files (hundreds GB to x TB) that are being used by some processes. I am thinking about turning off the running processes for a while then copying them to some specific location. But this approach looks clumsy for a couple of reasons.
Files are too large, so copying them from place to place takes time
on different disk types.
After the file copy process is done, I have to turn on the stopped processes. But what if my users need to load/copy other large
files whose handles the said processes are controling ? I have to stop them again. I don't want to do this
because they have to do many critical tasks in my machine.
So I have 2 questions,
Please explain that my approach is wrong. What I say above is only
my personal idea, no coding is done yet for any of it.
Are there any methods to clone ~50 large files (1-5 TB) fast (some ten seconds or so) and silently in the background ?
If the original process opens the file in non-sharing mode (seems likely for such large files) you are probably out of luck for doing this without closing that process - or at least getting it to relinquish the file. If it allows at least read sharing I would suggest you use transactional NTFS - despite the warnings on all the docs about that.
Create a KTM transaction manager (via CreateTransactionManager), create a KTM transaction (with CreateTransaction) then use CopyFileTransacted to do the actual copy. Finally commit the transaction (CommitTransaction) and then close all the handles. Doing it this way will ensure that the file is in a consistent state (no partial writes from the original process).
It may also be that the backup API can ignore share mode (I know it can ignore security checks if your process has the appropriate privileges enabled, not sure about sharing).

Robust way to detect if file has changed

I think this question hasn't been answered for my use-case.
We wish to detect if the user has changed a file without re-reading its contents for the purposes of caching a computation result based on the file contents. Our program is a long-running one that lets the user click a button to perform a computation based on data entered in the program and data stored in external files (sorry, I can't be more specific than that). The external data needs to be read, processed and various data structures need to be built based on it, so we try to cache those between computations to speed up re-computes when the user changes the data in the program itself, but not the data in the external files. However, if the external file has changed, we have to re-read that.
For each external resource we're checking if the modification time and file size have changed, but that's not really all that robust and can lead to user frustration if they have e.g. fileA and fileB with the same size and timestamp and copy or fileA to fileC, use fileC as an external resource, and then copy fileB to fileC. The system preserves the modification time of the original file and the sizes are the same, so we don't re-read the external resource.
Our program runs on Windows, macOS and Linux, is written in C++ and we're perfectly OK with using platform-specific code to detect file changes. We're interested in the most robust way to detect if the contents of a file identified by a file path have changed without actually reading the file itself.
I've made this answer a community wiki so others can add their ideas for the various platforms listed in the question.
Linux
MacOS
Windows
Option 1
Set up a thread that watches the directory containing the file. When the directory changes, you'll have to check if the file you care about has actually changed. That may mean opening and re-reading the file, (e.g., to compute the current checksum). But since you have to do this only after a change notification, this overhead may be acceptable.
I believe (but have not verified) that if someone copies a same-size, same-timestamp file over an existing file, you'll get a directory change notification.
Option 2
Hold the file open with an opportunistic lock. This involves creating the lock with a call to DeviceIoControl and then issuing a blocking call to GetOverlappedResult, which will unblock when another process attempts to change the file. Your program can the release the lock, allowing the other process to update the file, and know that the file is being changed.

Is it more efficient to rewind a file than closing it and opening it up again?

I'm writing a little C++ program for myself. At the begining of it, I read a file all the way to the bottom, and later on, right before the program ends, I need to read that file again from the begining.
My question is, is it more efficient to have the file open during the execution (even thought I won't be using it) and just rewind it when I need it again, or should I close it the first time and then open it again when I need it?
Edit: Just to clarify, my question is not only related to the specific project that I'm working on. It is really small (less than 300 lines of code), so there won't be any noticeable performance difference. I'm asking about opening, closing and "rewinding" files in general, so it's aplicable to other big projects were performance and memory may actually matter
If you close and open the file, the OS definitely need to update system lock for the file and list of resources (opened files) of your process. Furthermore close and open operation are two systems calls (kernel calls) and system call is not cheap. Every system call require translating of virtual address.
Closing the file can (if there is any change) force writing the cache to the hard-disk, this means seek time about 15ms (physical move of the platter). It can be even worse in the case of network drive.
After closing the file, some properties need to be updated. FileSystem watcher may be launched.
An antivirus scanning may be triggered after closing the file, it depends on filename, path, antivirus brand.
Furthermore closing the file is a risk, that you are not able to open it again because of another process. For example Dropbox read every file in Dropbox folder after change. So closing and opening file does not generally work in Dropbox folder (Dropbox may be faster). And who knows how users use your application. Users are inventive and they share files you didn't think of.
You might be able to measure a fraction of gained efficiency in the range of a few nanoseconds if you fseek to the beginning of the file but I don't think this is worth it when you are only dealing with a single file.
Like others said: try to find other areas of code which you can optimize.
As with all performance issues, the final optimizations vary widely. Measure both implementations against a reasonable data set and take it from there.
As a design choice it may be simpler to cache the contents of the file in memory once it has been read the first time and then there is no need to re-read the contents. If the modified content is required then again, cache the modified data to forgo the second read.

File corruption detection and error handling

I'm a newbie C++ developer and I'm working on an application which needs to write out a log file every so often, and we've noticed that the log file has been corrupted a few times when running the app. The main scenarios seems to be when the program is shutting down, or crashes, but I'm concerned that this isn't the only time that something may go wrong, as the application was born out of a fairly "quick and dirty" project.
It's not critical to have to the most absolute up-to-date data saved, so one idea that someone mentioned was to alternatively write to two log files, and then if the program crashes at least one will still have proper integrity. But this doesn't smell right to me as I haven't really seen any other application use this method.
Are there any "best practises" or standard "patterns" or frameworks to deal with this problem?
At the moment I'm thinking of doing something like this -
Write data to a temp file
Check the data was written correctly with a hash
Rename the original file, and put the temp file in place.
Delete the original
Then if anything fails I can just roll back by just deleting the temp, and the original be untouched.
You must find the reason why the file gets corrupted. If the app crashes unexpectedly, it can't corrupt the file. The only thing that can happen is that the file is truncated (i.e. the last log messages are missing). But the app can't really jump around in the file and modify something elsewhere (unless you call seek in the logging code which would surprise me).
My guess is that the app is multi threaded and the logging code is being called from several threads which can easily lead to data corrupted before the data is written to the log.
You probably forgot to call fsync() every so often, or the data comes in from different threads without proper synchronization among them. Hard to tell without more information (platform, form of corruption you see).
A workaround would be to use logfile rollover, ie. starting a new file every so often.
I really think that you (and others) are wasting your time when you start adding complexity to log files. The whole point of a log is that it should be simple to use and implement, and should work most of the time. To that end, just write the log to an unbuffered stream (l;ike cerr in a C++ program) and live with any, very occasional in my experience, snafus.
OTOH, if you really need an audit trail of everything your app does, for legal reasons, then you should be using some form of transactional storage such as a SQL database.
Not sure if your app is multi-threaded -- if so, consider using Active Object Pattern (PDF) to put a queue in front of the log and make all writes within a single thread. That thread can commit the log in the background. All logs writes will be asynchronous, and in order, but not necessarily written immediately.
The active object can also batch writes.

Ensuring a file is flushed when file created in external process (Win32)

Windows Win32 C++ question about flushing file activity to disk.
I have an external application (ran using CreateProcess) which does some file creation. i.e., when it returns it will have created a file with some content.
How can I ensure that the file the process created was really flushed to disk, before I proceed?
By this I mean not the C++ buffers but really flushing disk (e.g. FlushFileBuffers).
Remember that I don't have access to any file HANDLE - this is all of course hidden inside the external process.
I guess I could open up a handle of my own to the file and then use FlushFileBuffers, but it's not clear this would work (since my handle doesn't actually contain anything which needs flushing).
Finally, I want this to run in non-admin userspace so I cannot use FlushFileBuffers on a whole volume.
Any ideas?
UPDATE: Why do I think this is a problem?
I'm working on a data backup application. Essentially it has to create some files as described. It then has to update it's internal DB (using SQLite embedded DB).
I recently had a data corruption issue which occurred during a bluescreen (the cause of which was unrelated to my app).
What I'm concerned about is application integrity during a system crash. And yes, I do care about this because this app is a data backup app.
The use case I'm concerned about is this:
A small data file is created using external process. This write is waiting in the OS cache to be written to disk.
I update the DB and commit. This is a disk activity. This write is also waiting in the OS cache.
A system failure occurs.
As I see it, we're now in a potential race condition. If "1" gets flushed and "2" doesn't then we're fine (as the DB transact wasn't then committed). If neither gets flushed or both get flushed then we're also OK.
As I understand it, the writes will be non-deterministic. i.e., I'm not aware that the OS will guarantee to write "1" before "2". (Am I wrong?)
So, if "2" gets flushed, but "1" doesn't then we have a problem.
What I observed was that the DB was correctly updated, but that the file had garbage in: the last 2 thirds of the data was binary "zeroes". Now, I don't know what it looks like when you have a file part flushed at the time of bluescreen, but I wouldn't be surprised if it looked like that.
Can I guarantee this is the cause? No I cannot guarantee this. I'm just speculating. It could just be that the file was "naturally" corrupted due to disk failure or as a result of the blue screen.
With regards to performance, this is something I believe I can deal with.
For example, the default behaviour of SQLite is to do a full file flush (using FlushFileBuffers) every time you commit a transaction. They are quite clear that if you don't do this then at the time of system crash, you might have a corrupted DB.
Also, I believe I can mitigate the performance hit by only flushing at "checkpoints". For example, writing 50 files, flushing the lot and then writing to the DB.
How likely is all this to be a problem? Beats me. But then my app might well be archiving at or around the time of system failure so it might be more likely that you think.
Hope that explains why I wan't to do this.
Why would you want this? The OS will make sure that the data is flushed to the disk in due time. If you access it, it will either return the data from the cache or from disk, so this is transparent for you.
If you need some safety in case of disaster, then you must call FlushFileBuffers, for example by creating a process with admin rights after running the external process. But that can severely impact the performance of the whole machine.
Your only other option is to modify the source of the other process.
[EDIT] The most simple solution is probably to copy the file in your process and then flush the copy (since you have the handle). Save the copy under a name which says "not committed in the database".
Then update the database. Write into the database, "updated from file ...". If this entry already exists next time, don't update the database and skip this step.
Flush the database to disk.
Rename the file to "file has been processed into database". Rename is an atomic operation (so it either happens or not).
If you can't think of a good filename for the different states, then use subfolders and move the file between them.
Well, there are no attractive options here. There is no documented way to retrieve the file handle you need from the process. Although there are undocumented ones, go there (via DuplicateHandle) only with careful consideration.
Yes, calling FlushFileBuffers on a volume handle is the documented way. You can avoid the privilege problem by letting a service make the call. Talk to it from your app with one of the standard process interop mechanisms. A named pipe whose name is prefixed with Global\ is probably the easiest way to get that going.
After your update I think http://sqlite.org/atomiccommit.html gives you the answers you need.
The way SQLite ensures that everything is flushed to disc works. So it works for you as well - take a look at the source.