Using FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_DELETE_ON_CLOSE - c++

I am using the two flags FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_DELETE_ON_CLOSE while creating temporary files in my C++ application.
According to this blogđź•— there shouldn't be any file being created on the disk:
It’s only temporary
Larry Osterman, April 19, 2004
To create a “temporary” file, you call CreateFile specifying FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE in the dwFlagsAndAttributes attribute. This combination of bits acts as a hint to the filesystem that the file data should never be flushed to disk. In other words, such a file can be created, written to, and read from without the system ever touching the disk.
But in my code the file is created and written to on disk (even for 1 KB data). Can someone confirm the exact functionality of these flags, and whether the files are created on disk or not?

Later on in that same link, there is the quote:
If you exceed available memory, the
memory manager will flush the file
data to disk. This causes a
performance hit, but your operation
will succeed instead of failing.
Marking a file as temporary will tell the system it doesn't need to be on disk, but it doesn't prevent it from being put there, either.

It just says that the file will never be flushed to disk. That means, while it exists in your filesystem, it will never be physically stored on your hard drive. The file system will show it though, with actual size and all.

Related

How to write to a file in a recoverable (transactional) way?

I have files on disk that contain critical information that needs to be updated every so often. It's safe to assume that only one process accesses that file, that the app is running on Windows and on an NTFS file system.
My goal is to prevent possible data corruption (partial write, for example) in case of program crashes, power outage, etc. In this question, I'm not concerned with disk sector corruption.
This is my current approach:
On write:
Write new data in a temporary file.
Create a voucher file. (This file guarantees that the temporary file has valid contents.)
Write new data in the target file.
Delete the voucher file.
Delete the temporary file.
On read:
If the voucher file exists:
Copy the temporary file to the target file.
Delete the voucher file.
Read the target file.
Delete the temporary file, if it exists.
Questions:
With my current approach, if I'm overwriting only a few bytes of the file, if it is possible for other bytes to erroneously change as a result of some failure?
With my current approach, is it neccessary to disable file caching, or does Windows reliably recover the cache in case of power outage?
Is there a faster approach, by using some knowledge of how either Windows or NTFS work? If so, I would be interested if there are any differences on FAT file systems.
I'm aware of the transactional NTFS API in Win32, but I don't want to use it since it's deprecated.

Robust way to detect if file has changed

I think this question hasn't been answered for my use-case.
We wish to detect if the user has changed a file without re-reading its contents for the purposes of caching a computation result based on the file contents. Our program is a long-running one that lets the user click a button to perform a computation based on data entered in the program and data stored in external files (sorry, I can't be more specific than that). The external data needs to be read, processed and various data structures need to be built based on it, so we try to cache those between computations to speed up re-computes when the user changes the data in the program itself, but not the data in the external files. However, if the external file has changed, we have to re-read that.
For each external resource we're checking if the modification time and file size have changed, but that's not really all that robust and can lead to user frustration if they have e.g. fileA and fileB with the same size and timestamp and copy or fileA to fileC, use fileC as an external resource, and then copy fileB to fileC. The system preserves the modification time of the original file and the sizes are the same, so we don't re-read the external resource.
Our program runs on Windows, macOS and Linux, is written in C++ and we're perfectly OK with using platform-specific code to detect file changes. We're interested in the most robust way to detect if the contents of a file identified by a file path have changed without actually reading the file itself.
I've made this answer a community wiki so others can add their ideas for the various platforms listed in the question.
Linux
MacOS
Windows
Option 1
Set up a thread that watches the directory containing the file. When the directory changes, you'll have to check if the file you care about has actually changed. That may mean opening and re-reading the file, (e.g., to compute the current checksum). But since you have to do this only after a change notification, this overhead may be acceptable.
I believe (but have not verified) that if someone copies a same-size, same-timestamp file over an existing file, you'll get a directory change notification.
Option 2
Hold the file open with an opportunistic lock. This involves creating the lock with a call to DeviceIoControl and then issuing a blocking call to GetOverlappedResult, which will unblock when another process attempts to change the file. Your program can the release the lock, allowing the other process to update the file, and know that the file is being changed.

Overwriting a file without the risk of a corrupt file

So often my applications want to save files to load again later. Having recently got unlucky with a crash, I want to write the operation in such a way that I am guaranteed to either have the new data, or the original data, but no a corrupted mess.
My first idea was to do something along the lines of (to save a file called example.dat):
Come up with a unique file name for the target directory, e.g. example.dat.tmp
Create that file and write my data to it.
Delete the original file (example.dat)
Rename ("Move") the temp file to where the original was (example.dat.tmp -> example.dat).
Then at load time the application can follow the following rules:
If no "example.dat" and no "example.dat.tmp", first run / new project, so load in the defaults / create new file.
If "example.dat" and no "example.dat.tmp", then load example.dat (normal load case)
If "example.dat.tmp" exists offer the user the chance to potentially recover data. If "example.dat" also exists, do not overwrite it without explicit user constant.
However, having done a little research, I found that as well as OS caching which I may be able to override with the file flush methods, some disk drives still then cache internally and may even lie to the OS saying they are done, so 4. could complete, the write is not actually written, and if the system goes down I have lost my data...
I am not sure the disk problem is actually solvable by an application, but are the general rules above the correct thing to do? Should I keep an old recovery copy of the file for longer to be sure, what are the guidelines regarding such things (e.g. acceptable disk usage, should the user choose, where to put such files, etc.).
Also how should I avoid potential conflict the user and other programs for "example.dat.tmp". I recall seeing a "~example.dat" sometimes from some other software, is that a better convention?
If the disk drives report back to the OS that the data is
physically on the disk, and it's not, then there's not much you
can do about it. A lot of disks do cache a certain number of
writes, and report them done, but such disks should have
a battery backup, and finish the physical writes no matter what
(and they won't loose data in case of a system crash, since they
won't even see it).
For the rest, you say you've done some research, so you no doubt
know that you can't use std::ofstream (nor FILE*) for this;
you have to do the actual writes at the system level, and open
the files with special attributes for them to ensure full
synchronization. Otherwise, the operations can stick around in
the OS buffering for a while. And that as far as I know,
there's no way of ensuring such synchronization for a rename.
(But I'm not sure that it's necessary, if you always keep two
versions: my usual convention in such cases is to write to
a file "example.dat.new", then when I'm done writing, delete
any file named "example.dat.bak", rename "example.dat" to
"example.dat.bak", and then rename "example.dat.new" to
"example.dat". Given this, you should be able to figure out
what did or did not happen, and find the correct file
(interactively, if need be, or insert an initial line with the
timestamp).
You should lock the actual data file while you write its substitute, if there's a chance that a different process could be going through the same protocol that you are describing.
You can use flock for the file lock.
As for your temp file name, you could make your process ID part of it, for instance "example.dat.3124," No other simultaneously-running process would generate the same name.

Using temporary files safely

There is a static library I use in my program which can only take filenames as its input, not actual file contents. There is nothing I can do about the library's source code. So I want to: create a brand-new file, store data to being processed into it, flush it onto the disk(?), pass its name to the library, then delete it.
But I also want this process to be rather secure:
1) the file must be created anew, without any bogus data (maybe it's not critical, but whatever);
2) anyone but my process must not be able read or write from/to this file (I want the library to process my actual data, not bogus data some wiseguy managed to plug in);
3) after I'm done with this file, it must be deleted (okay, if someone TerminateProcess() me, I guess there is nothing much can be done, but still).
The library seems to use non-Unicode fopen() to open the given file though, so I am not quite sure how to handle all this, since the program is intended to run on Windows. Any suggestions?
You have a lot of suggestions already, but another option that I don't think has been mentioned is using named pipes. It will depend on the library in question as to whether it works or not, but it might be worth a try. You can create a named pipe in your application using the CreateNamedPipe function, and pass the name of the pipe to the library to operate on (the filename you would pass would be \\.\pipe\PipeName). Whether the library accepts a filename like that or not is something you would have to try, but if it works the advantage is your file never has to actually be written to disk.
This can be achieved using the CreateFile and GetTempFileName functions (if you don't know if you can write to the current working directory, you may also want to use , GetTempPath).
Determine a directory to store your temporary file in; the current directory (".") or the result of GetTempPath would be good candidates.
Use GetTempFileName to create a temporary file name.
Finally, call CreateFile to create the temporary file.
For the last step, there are a few things to consider:
The dwFlagsAndAttributes parameter of CreateFile should probably include FILE_ATTRIBUTE_TEMPORARY.
The dwFlagsAndAttributes parameter should probably also include FILE_FLAG_DELETE_ON_CLOSE to make sure that the file gets deleted no matter what (this probably also works if your process crashes, in which case the system closes all handles for you).
The dwShareMode parameter of CreateFile should probably be FILE_SHARE_READ so that other attempts to open the file will succeed, but only for reading. This means that your library code will be able to read the file, but nobody will be able to write to it.
This article should give you some good guidelines on the issue.
The gist of the matter is this:
The POSIX mkstemp() function is the secure and preferred solution where available. Unfortunately, it is not available in Windows, so you would need to find a wrapper that properly implements this functionality using Windows API calls.
On Windows, the tmpfile_s() function is the only one that actually opens the temporary file atomically (instead of simply generating a filename), protecting you from a race condition. Unfortunately, this function does not allow you to specify which directory the file will be created in, which is a potential security issue.
Primarily, you can create file in user's temporary folder (eg. C:\Users\\AppData\Local\Temp) - it is a perfect place for such files. Secondly, when creating a file, you can specify, what kind of access sharing do you provide.
Fragment of CreateFile help page on MSDN:
dwShareMode
0 Prevents other processes from opening a file or device
if they request delete, read, or write access.
FILE_SHARE_DELETE Enables subsequent open operations on a file or device to
request delete access. Otherwise, other processes cannot open the file or device if they
request delete access. If this flag is not specified, but the file or device has been opened for delete access, the function fails. Note: Delete access allows both delete and rename operations.
FILE_SHARE_READ Enables subsequent open operations on a
file or device to request read access. Otherwise, other processes cannot open the file or device if they request read access. If this flag is not specified, but the file or device has been opened for read access, the function fails.
FILE_SHARE_WRITE Enables subsequent open operations on a file or device to request
write access.
Otherwise, other processes cannot open the file or device if they
request write access.
If this flag is not specified, but the file or device has been opened
for write access or has a file mapping with write access, the function
fails.
Whilst suggestions given are good, such as using FILE_SHARE_READ, FILE_DELETE_ON_CLOSE, etc, I don't think there is a completely safe way to do thist.
I have used Process Explorer to close files that are meant to prevent a second process starting - I did this because the first process got stuck and was "not killable and not dead, but not responding", so I had a valid reason to do this - and I didn't want to reboot the machine at that particular point due to other processes running on the system.
If someone uses a debugger of some sort [including something non-commercial, written specifically for this purpose], attaches to your running process, sets a breakpoint and stops the code, then closes the file you have open, it can write to the file you just created.
You can make it harder, but you can't stop someone with sufficient privileges/skills/capabilities from intercepting your program and manipulating the data.
Note that file/folder protection only works if you reliably know that users don't have privileged accounts on the machine - typical Windows users are either admins right away, or have another account for admin purposes - and I have access to sudo/root on nearly all of the Linux boxes I use at work - there are some fileservers that I don't [and shouldn't] have root access. But all the boxes I use myself or can borrow of testing purposes, I can get to a root environment. This is not very unusual.
A solution I can think of is to find a different library that uses a different interface [or get the sources of the library and modify it so that it]. Not that this prevents a "stop, modify and go" attack using the debugger approach described above.
Create your file in your executable's folder using CreateFile API, You can give the file name some UUID, each time its created, so that no other process can guess the file name to open it. and set its attribute to hidden. After using it, just delete the file .Is it enough?

Ensuring a file is flushed when file created in external process (Win32)

Windows Win32 C++ question about flushing file activity to disk.
I have an external application (ran using CreateProcess) which does some file creation. i.e., when it returns it will have created a file with some content.
How can I ensure that the file the process created was really flushed to disk, before I proceed?
By this I mean not the C++ buffers but really flushing disk (e.g. FlushFileBuffers).
Remember that I don't have access to any file HANDLE - this is all of course hidden inside the external process.
I guess I could open up a handle of my own to the file and then use FlushFileBuffers, but it's not clear this would work (since my handle doesn't actually contain anything which needs flushing).
Finally, I want this to run in non-admin userspace so I cannot use FlushFileBuffers on a whole volume.
Any ideas?
UPDATE: Why do I think this is a problem?
I'm working on a data backup application. Essentially it has to create some files as described. It then has to update it's internal DB (using SQLite embedded DB).
I recently had a data corruption issue which occurred during a bluescreen (the cause of which was unrelated to my app).
What I'm concerned about is application integrity during a system crash. And yes, I do care about this because this app is a data backup app.
The use case I'm concerned about is this:
A small data file is created using external process. This write is waiting in the OS cache to be written to disk.
I update the DB and commit. This is a disk activity. This write is also waiting in the OS cache.
A system failure occurs.
As I see it, we're now in a potential race condition. If "1" gets flushed and "2" doesn't then we're fine (as the DB transact wasn't then committed). If neither gets flushed or both get flushed then we're also OK.
As I understand it, the writes will be non-deterministic. i.e., I'm not aware that the OS will guarantee to write "1" before "2". (Am I wrong?)
So, if "2" gets flushed, but "1" doesn't then we have a problem.
What I observed was that the DB was correctly updated, but that the file had garbage in: the last 2 thirds of the data was binary "zeroes". Now, I don't know what it looks like when you have a file part flushed at the time of bluescreen, but I wouldn't be surprised if it looked like that.
Can I guarantee this is the cause? No I cannot guarantee this. I'm just speculating. It could just be that the file was "naturally" corrupted due to disk failure or as a result of the blue screen.
With regards to performance, this is something I believe I can deal with.
For example, the default behaviour of SQLite is to do a full file flush (using FlushFileBuffers) every time you commit a transaction. They are quite clear that if you don't do this then at the time of system crash, you might have a corrupted DB.
Also, I believe I can mitigate the performance hit by only flushing at "checkpoints". For example, writing 50 files, flushing the lot and then writing to the DB.
How likely is all this to be a problem? Beats me. But then my app might well be archiving at or around the time of system failure so it might be more likely that you think.
Hope that explains why I wan't to do this.
Why would you want this? The OS will make sure that the data is flushed to the disk in due time. If you access it, it will either return the data from the cache or from disk, so this is transparent for you.
If you need some safety in case of disaster, then you must call FlushFileBuffers, for example by creating a process with admin rights after running the external process. But that can severely impact the performance of the whole machine.
Your only other option is to modify the source of the other process.
[EDIT] The most simple solution is probably to copy the file in your process and then flush the copy (since you have the handle). Save the copy under a name which says "not committed in the database".
Then update the database. Write into the database, "updated from file ...". If this entry already exists next time, don't update the database and skip this step.
Flush the database to disk.
Rename the file to "file has been processed into database". Rename is an atomic operation (so it either happens or not).
If you can't think of a good filename for the different states, then use subfolders and move the file between them.
Well, there are no attractive options here. There is no documented way to retrieve the file handle you need from the process. Although there are undocumented ones, go there (via DuplicateHandle) only with careful consideration.
Yes, calling FlushFileBuffers on a volume handle is the documented way. You can avoid the privilege problem by letting a service make the call. Talk to it from your app with one of the standard process interop mechanisms. A named pipe whose name is prefixed with Global\ is probably the easiest way to get that going.
After your update I think http://sqlite.org/atomiccommit.html gives you the answers you need.
The way SQLite ensures that everything is flushed to disc works. So it works for you as well - take a look at the source.