how to process a file exactly once

how to process a file exactly once - concurrency

My Java application reads data from directory and puts into a common shared resource queue, the consumer will consume the event validate the event and save it into database, i want to process the files in the directory only once even if the application restarted it should not process the file again rather start from file from where it had stopped, can anyone help me out with this

Do you have any code that we could look at? That way we can see exactly what you need.
And with what you have you might want to look into how the program can "save progress". You might be able to do something there.

Related

Where in the process is a file open?

I have an application suite that I maintain for Windows platforms. I recently added some code to a shared library to remove a directory after the app is done with it. In one app, the deletion is successful; in the other, I receive a message telling me the file is in use by another process.
After downloading Process Explorer, I learned what I had already expected, that the process holding the folder is the one trying to delete it.
When I google for an answer, all I see is, "You need to download XYZ to find out what process is holding the file, then close that process," where "XYZ" is Unlocker, Process Explorer, etc. I know the process that is holding the file, but if I terminate it, how can it delete the folder?
Does anyone have any idea about how to locate the code that is holding the folder open? Of the tools that are available for finding which processes are using which files, can any be used to find where in the process the folder is open?

There is no concept of a "location in the process" where a file is open. E.g. a very common cause of unintentional open files are leaked handles. That means the file is open precisely because there is no location for the file handle in the process anymore.

How to check if a file is still being written?

How can I check if a file is still being written? I need to wait for a file to be created, written and closed again by another process, so I can go on and open it again in my process.

In general, this is a difficult problem to solve. You can ask whether a file is open, under certain circumstances; however, if the other process is a script, it might well open and close the file multiple times. I would strongly recommend you use an advisory lock, or some other explicit method for the other process to communicate when it's done with the file.
That said, if that's not an option, there is another way. If you look in the /proc/<pid>/fd directories, where <pid> is the numeric process ID of some running process, you'll see a bunch of symlinks to the files that process has open. The permissions on the symlink reflect the mode the file was opened for - write permission means it was opened for write mode.
So, if you want to know if a file is open, just scan over every process's /proc entry, and every file descriptor in it, looking for a writable symlink to your file. If you know the PID of the other process, you can directly look at its proc entry, as well.
This has some major downsides, of course. First, you can only see open files for your own processes, unless you're root. It's also relatively slow, and only works on Linux. And again, if the other process opens and closes the file several times, you're stuck - you might end up seeing it during the closed period, and there's no easy way of knowing if it'll open it again.

You could let the writing process write a sentinel file (say "sentinel.ok") after it is finished writing the data file your reading process is interested in. In the reading process you can check for the existence of the sentinel before reading the data file, to ensure that the data file is completely written.

#blu3bird's idea of using a sentinel file isn't bad, but it requires modifying the program that's writing the file.
Here's another possibility that also requires modifying the writer, but it may be more robust:
Write to a temporary file, say "foo.dat.part". When writing is complete, rename "foo.dat.part" to "foo.dat". That way a reader either won't see "foo.dat" at all, or will see a complete version of it.

You can try using inotify
http://en.wikipedia.org/wiki/Inotify
If you know that the file will be opened once, written and then closed, it would be possible for your app to wait for the IN_CLOSE_WRITE event.
However if the behaviour of the other application doing the writing of the file is more like open,write,close,open,write,close....then you'll need some other mechanism of determining when the other app has truly finished with the file.

How to determine when files are done copying for further processing?

Alright so to start this is strictly for Windows and I'd prefer to use C++ over .NET but I'm not opposed to boost::filesystem although if it can be avoided in favor of straight Windows API I'd prefer that.
Now the scenario is an application on another machine I can't change is going to create files in a particular directory on the machine that I need to make backups of and do some extra processing. Currently I've made a little application which will sit and listen for change notifications in a target directory using FindFirstChangeNotification and FindNextChangeNotification windows APIs.
The problem is that while I can get notified when new files are created in the directory, modified, size changes, etc it only notifies once and does not specifically tell me which files. I've looked at ReadDirectoryChangesW as well but it's the same story there except that I can get slightly more specific information.
Now I can scan the directory and try to acquire locks or open the files to determine what specifically changed from the last notification and whether they are available for further use but in the case of copying a large file I've found this isn't good enough as the file won't be ready to be manipulated and I won't get any other notifications after the first so there is no way to tell when it's actually done copying unless after the first notification I continually try to acquire locks until it succeeds.
The only other thing I can think of that would be less hackish would be to have some kind of end token file but since I don't have control over the application creating the files in the first place I don't see how I'd go about doing that and it's still not ideal.
Any suggestions?

This is a fairly common problem and one that doesn't have an easy answer. Acquiring locks is one of the best options when you cannot change the thing at the remote end. Another I have seen is to watch the file at intervals until the size doesn't change for an interval or two.
Other strategies include writing a no-byte file as a trigger when the main file is complete and writing to a temp directory then moving the complete file to the real destination. But to be reliable, it must be the sender who controls this. As the receiver, you are constrained to watching the directory and waiting for the file to settle.

It looks like ReadDirectoryChangesW is going to be your best bet. For each file copy operation, you should be receiving FILE_ACTION_ADDED followed by a bunch of FILE_ACTION_MODIFIED notifications. On the last FILE_ACTION_MODIFIED notification, the file should no longer be locked by the copying process. So, if you try to acquire a lock after each FILE_ACTION_MODIFIED of the copy, it should fail until the copy completes. It's not a particularly elegant solution, but there doesn't seem to be any notifications available for when a file copy completes.

You can process the data once the file is closed, right? So the task is to track when the file is closed. This can be done using file system filter driver. You can write your own or you can use our CallbackFilter product.

how to make sure not to read a file before finishing the write to it

When trying to monitor a directory using inotify on Linux, as we know, we get notified as soon as the file gets created (before the other process finish writing to it)
Is there an effective way to make sure that the file is not read before writing to it is complete by the other process?
We could potentially add a delayed read; but as we all know, it is flawed.
For a little bit more clarity on the scenario; the two processes are running as different users; the load expected is about a few hundred files created per second.

Based on your question, it sounds like you're currently monitoring the directory with the IN_CREATE (and maybe IN_OPEN) flag. Why not also use the IN_CLOSE flag so that you get notified when the file is closed? From there, it should be easy to keep track of whether something has the file open and you'll know that you don't want to try reading it yet.

Create it somewhere else, write to it, close it, then rename it - or am I missing something obvious?

You can check /proc/<pid>/fd to see if the file is still opened. If it is not listed there, you can be sure that the process is no longer using it.

maybe the lsof command can help. It lists all the opened files.
$ man lsof

Intercept windows open file

I'm trying to make a small program that could intercept the open process of a file.
The purpose is when an user double-click on a file in a given folder, windows would inform to the software, then it process that petition and return windows the data of the file.
Maybe there would be another solution like monitoring Open messages and force Windows to wait while the program prepare the contents of the file.
One application of this concept, could be to manage desencryption of a file in a transparent way to the user.
In this context, the encrypted file would be on the disk and when the user open it ( with double-click on it or with some application such as notepad ), the background process would intercept that open event, desencrypt the file and give the contents of that file to the asking application.
It's a little bit strange concept, it could be like "Man In The Middle" network concept, but with files instead of network packets.
Thanks for reading.

The best way to do it to cover all cases of opening from any program would be via a file system filter driver. This may be too complex for your needs though.

You can use the trick that Process Explorer uses to replace itself with task manager. Basically create a key like this:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\taskmgr.exe
Where you replace 'taskmgr.exe' with the name of the process to intercept. Then add a string value called 'Debugger' that has the path to your executable. E.g:
Debugger -> "C:\windows\system32\notepad.exe"
Every a process is run that matches the image name your process will actually be called as a debugger for that process with the path to the actual process as an argument.

You could use code injection and API redirection. You'd start your target process and then inject a DLL which hooks the windows API functions that you want to intercept. You then get called when the target process thinks it's calling OpenFile() or whatever and you can do what you like before passing the call on to the real API.
Google for "IAT hooking".

Windows has an option to encrypt files on the disk (file->properties->advanced->encrypt) and this option is completely transparent to the applications.
Maybe to encrypt decrypt file portions of a disk you should consider softwares like criptainer?
There is this software as well http://www.truecrypt.org/downloads (free and open source) but I haven't tried it.
Developing a custom solution sounds very difficult.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how to process a file exactly once - concurrency

Do you have any code that we could look at? That way we can see exactly what you need. And with what you have you might want to look into how the program can "save progress". You might be able to do something there.

Related

Where in the process is a file open?

How to check if a file is still being written?

How to determine when files are done copying for further processing?

how to make sure not to read a file before finishing the write to it

Intercept windows open file

Categories

Resources