How to determine when files are done copying for further processing?

How to determine when files are done copying for further processing? - c++

Alright so to start this is strictly for Windows and I'd prefer to use C++ over .NET but I'm not opposed to boost::filesystem although if it can be avoided in favor of straight Windows API I'd prefer that.
Now the scenario is an application on another machine I can't change is going to create files in a particular directory on the machine that I need to make backups of and do some extra processing. Currently I've made a little application which will sit and listen for change notifications in a target directory using FindFirstChangeNotification and FindNextChangeNotification windows APIs.
The problem is that while I can get notified when new files are created in the directory, modified, size changes, etc it only notifies once and does not specifically tell me which files. I've looked at ReadDirectoryChangesW as well but it's the same story there except that I can get slightly more specific information.
Now I can scan the directory and try to acquire locks or open the files to determine what specifically changed from the last notification and whether they are available for further use but in the case of copying a large file I've found this isn't good enough as the file won't be ready to be manipulated and I won't get any other notifications after the first so there is no way to tell when it's actually done copying unless after the first notification I continually try to acquire locks until it succeeds.
The only other thing I can think of that would be less hackish would be to have some kind of end token file but since I don't have control over the application creating the files in the first place I don't see how I'd go about doing that and it's still not ideal.
Any suggestions?

This is a fairly common problem and one that doesn't have an easy answer. Acquiring locks is one of the best options when you cannot change the thing at the remote end. Another I have seen is to watch the file at intervals until the size doesn't change for an interval or two.
Other strategies include writing a no-byte file as a trigger when the main file is complete and writing to a temp directory then moving the complete file to the real destination. But to be reliable, it must be the sender who controls this. As the receiver, you are constrained to watching the directory and waiting for the file to settle.

It looks like ReadDirectoryChangesW is going to be your best bet. For each file copy operation, you should be receiving FILE_ACTION_ADDED followed by a bunch of FILE_ACTION_MODIFIED notifications. On the last FILE_ACTION_MODIFIED notification, the file should no longer be locked by the copying process. So, if you try to acquire a lock after each FILE_ACTION_MODIFIED of the copy, it should fail until the copy completes. It's not a particularly elegant solution, but there doesn't seem to be any notifications available for when a file copy completes.

You can process the data once the file is closed, right? So the task is to track when the file is closed. This can be done using file system filter driver. You can write your own or you can use our CallbackFilter product.

Related

Robust way to detect if file has changed

I think this question hasn't been answered for my use-case.
We wish to detect if the user has changed a file without re-reading its contents for the purposes of caching a computation result based on the file contents. Our program is a long-running one that lets the user click a button to perform a computation based on data entered in the program and data stored in external files (sorry, I can't be more specific than that). The external data needs to be read, processed and various data structures need to be built based on it, so we try to cache those between computations to speed up re-computes when the user changes the data in the program itself, but not the data in the external files. However, if the external file has changed, we have to re-read that.
For each external resource we're checking if the modification time and file size have changed, but that's not really all that robust and can lead to user frustration if they have e.g. fileA and fileB with the same size and timestamp and copy or fileA to fileC, use fileC as an external resource, and then copy fileB to fileC. The system preserves the modification time of the original file and the sizes are the same, so we don't re-read the external resource.
Our program runs on Windows, macOS and Linux, is written in C++ and we're perfectly OK with using platform-specific code to detect file changes. We're interested in the most robust way to detect if the contents of a file identified by a file path have changed without actually reading the file itself.

I've made this answer a community wiki so others can add their ideas for the various platforms listed in the question.
Linux
MacOS
Windows
Option 1
Set up a thread that watches the directory containing the file. When the directory changes, you'll have to check if the file you care about has actually changed. That may mean opening and re-reading the file, (e.g., to compute the current checksum). But since you have to do this only after a change notification, this overhead may be acceptable.
I believe (but have not verified) that if someone copies a same-size, same-timestamp file over an existing file, you'll get a directory change notification.
Option 2
Hold the file open with an opportunistic lock. This involves creating the lock with a call to DeviceIoControl and then issuing a blocking call to GetOverlappedResult, which will unblock when another process attempts to change the file. Your program can the release the lock, allowing the other process to update the file, and know that the file is being changed.

How to recursively monitor an directory besides using inotify

I want to make an C++ application that will monitor file changes in its directory or its sub-directories. I know inotify can monitor single level directory changes and i need to manually add a watch for each sub-directory to detect changes in the sub-directory.
I need to know if there is any other way to recursively monitor changes in a directory in Linux other than inotify.

To monitor recursively a directory you have to:
Create an inotify(7) object with inotify_init(2) or inotify_init2(2).
Descend recursively on your directory, using inotify_add_watch(2) for all the nodes you want to be notified on (add the watch for the directory itself before scanning, or you'll lose events ---se below).
Wait for events to come on the inotify descriptor you have received.
Take into account that directory creation forces you to possible rescan of directory contents, as you can get an event for a new subdirectory on a watched directory, but files can be created on it before you had a chance to add watches for the recently created directory, so you will not be informed of that files creation.
For this last reason, you'll need to consider putting all things in a subroutine and to call it each time a new directory gets created. This way perhaps you'll scan files for which you'll receive creation events, but the other side you can lost events.
Also, you must prepare yourshelf to do a complete tree rescan in case you lose some events (in a full queue overflow).
And believe me, this is by far more efficient than to do it the classic way. And you can bypass short lived files between rescans.
The reason there's not a recursive solution to this problem has been pointed above in one comment (You'll need the kernel to do the search for you, even if you are not interested on it)

Yes there is the old classic way: Simply check the directory (recursively) in intervals. Store a list of all files and directories and their modification dates. Then it's easy to see if you have a new file/directory, if one has been removed, or if otherwise modified.
It is time consuming though, and you need to store data about every file and directory, so if you try to do it on the root directory you will use a lot of memory or storage.

Probably you want to mix inotify and dnotify to achieve that.

How to check if a file is still being written?

How can I check if a file is still being written? I need to wait for a file to be created, written and closed again by another process, so I can go on and open it again in my process.

In general, this is a difficult problem to solve. You can ask whether a file is open, under certain circumstances; however, if the other process is a script, it might well open and close the file multiple times. I would strongly recommend you use an advisory lock, or some other explicit method for the other process to communicate when it's done with the file.
That said, if that's not an option, there is another way. If you look in the /proc/<pid>/fd directories, where <pid> is the numeric process ID of some running process, you'll see a bunch of symlinks to the files that process has open. The permissions on the symlink reflect the mode the file was opened for - write permission means it was opened for write mode.
So, if you want to know if a file is open, just scan over every process's /proc entry, and every file descriptor in it, looking for a writable symlink to your file. If you know the PID of the other process, you can directly look at its proc entry, as well.
This has some major downsides, of course. First, you can only see open files for your own processes, unless you're root. It's also relatively slow, and only works on Linux. And again, if the other process opens and closes the file several times, you're stuck - you might end up seeing it during the closed period, and there's no easy way of knowing if it'll open it again.

You could let the writing process write a sentinel file (say "sentinel.ok") after it is finished writing the data file your reading process is interested in. In the reading process you can check for the existence of the sentinel before reading the data file, to ensure that the data file is completely written.

#blu3bird's idea of using a sentinel file isn't bad, but it requires modifying the program that's writing the file.
Here's another possibility that also requires modifying the writer, but it may be more robust:
Write to a temporary file, say "foo.dat.part". When writing is complete, rename "foo.dat.part" to "foo.dat". That way a reader either won't see "foo.dat" at all, or will see a complete version of it.

You can try using inotify
http://en.wikipedia.org/wiki/Inotify
If you know that the file will be opened once, written and then closed, it would be possible for your app to wait for the IN_CLOSE_WRITE event.
However if the behaviour of the other application doing the writing of the file is more like open,write,close,open,write,close....then you'll need some other mechanism of determining when the other app has truly finished with the file.

Win32 C++ ReadDirectoryChangesW "creation" and "modification" of file difference detect?

Here is the problem: I monitor a directory using Win32 API ReadDirectoryChangesW function. And I need to distinguish between newly created files and modified files. But there are problems... as always :(
Cases:
I monitor directory for new/modify (FILE_NOTIFY_CHANGE_FILE_NAME | FILE_NOTIFY_CHANGE_SIZE). Problem: After file creation, new file event + modify file event is triggered. But i need only one. How can I avoid that? When file is modified I get what I want :).
I monitor directory only for new file (FILE_NOTIFY_CHANGE_FILE_NAME) - NO PROBLEM.
I monitor directory only for modify file (FILE_NOTIFY_CHANGE_SIZE). Problem: When a new file is, modify action is fired along with file creation event. How can I avoid that?
Of course, I implemented some workarounds. But, I want to know if there any elegant way of handling the problems I described.

You should be catching FILE_NOTIFY_CHANGE_LAST_WRITE, not FILE_NOTIFY_CHANGE_SIZE, for a modified file. Files may be modified without the size changing.
You should also keep a queue of changes and the time they happened and only process the queue after there have been no changes in the past 1-2 seconds. Some applications can do very strange things when creating or modifying files, and you'll most likely want to special case for popular applications if you plan on using this code in the wild.
ReadDirectoryChanges isn't one of the friendliest winapi functions. You probably can't get around receiving two events on file creation; I'm not completely sure whether you'll get an extra modify for FILE_NOTIFY_CHANGE_LAST_WRITE on creation, but I think you probably will. Using the queue approach will allow you to easily throw out the extra event if it has the same time stamp as the creation event.

File corruption detection and error handling

I'm a newbie C++ developer and I'm working on an application which needs to write out a log file every so often, and we've noticed that the log file has been corrupted a few times when running the app. The main scenarios seems to be when the program is shutting down, or crashes, but I'm concerned that this isn't the only time that something may go wrong, as the application was born out of a fairly "quick and dirty" project.
It's not critical to have to the most absolute up-to-date data saved, so one idea that someone mentioned was to alternatively write to two log files, and then if the program crashes at least one will still have proper integrity. But this doesn't smell right to me as I haven't really seen any other application use this method.
Are there any "best practises" or standard "patterns" or frameworks to deal with this problem?
At the moment I'm thinking of doing something like this -
Write data to a temp file
Check the data was written correctly with a hash
Rename the original file, and put the temp file in place.
Delete the original
Then if anything fails I can just roll back by just deleting the temp, and the original be untouched.

You must find the reason why the file gets corrupted. If the app crashes unexpectedly, it can't corrupt the file. The only thing that can happen is that the file is truncated (i.e. the last log messages are missing). But the app can't really jump around in the file and modify something elsewhere (unless you call seek in the logging code which would surprise me).
My guess is that the app is multi threaded and the logging code is being called from several threads which can easily lead to data corrupted before the data is written to the log.

You probably forgot to call fsync() every so often, or the data comes in from different threads without proper synchronization among them. Hard to tell without more information (platform, form of corruption you see).
A workaround would be to use logfile rollover, ie. starting a new file every so often.

I really think that you (and others) are wasting your time when you start adding complexity to log files. The whole point of a log is that it should be simple to use and implement, and should work most of the time. To that end, just write the log to an unbuffered stream (l;ike cerr in a C++ program) and live with any, very occasional in my experience, snafus.
OTOH, if you really need an audit trail of everything your app does, for legal reasons, then you should be using some form of transactional storage such as a SQL database.

Not sure if your app is multi-threaded -- if so, consider using Active Object Pattern (PDF) to put a queue in front of the log and make all writes within a single thread. That thread can commit the log in the background. All logs writes will be asynchronous, and in order, but not necessarily written immediately.
The active object can also batch writes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js