C/C++ Determine Whether Files have been completely written - c++

I have a directory (DIR_A) to dump from Server A to Server B which is
expected to take a few weeks. DIR_A has the normal tree
structure i.e. a directory could have subfolders or files, etc
Aim:
As DIR_A is being dumped to server B, I will have to
go through DIR_A and search for certain files within it (do not know the
exact name of each file because server A changes the names of all the files
being sent). I cannot wait for weeks to process some files within DIR_A. So, I want to
start manipulating some of the files once I receive them at server B.
Brief:
Server A sends DIR_A to Server B. Expected to take weeks.
I have to start processing the files at B before the upload is
complete.
Attempt Idea:
I decided to write a program that will list the contents of DIR_A.
I went on finding out whether files exist within folders and subfolders of DIR_A.
I thought that I might look for the EOF of a file within DIR_A. If it is not present
then the file has not yet been completely uploaded. I should wait till the EOF
is found. So, I keep looping, calculating the size of the file and verifying whether EOF is present. If this is the case, then I start processing that file.
To simulate the above, I decided to write and execute a program writing to
a text file and then stopped it in the middle without waiting for completion.
I tried to use the program below to determine whether the EOF could be found. I assumed that since I abrubtly ended the program writing to the text file the eof will not be present and hence the output "EOF FOUND" should not be reached. I am wrong since this was reached. I also tried with feof(), and fseek().
std::ifstream file(name_of_file.c_str, std::ios::binary);
//go to the end of the file to determine eof
char character;
file.seekg(0, ios::end);
while(!file.eof()){
file.read(character, sizeof(char));
}
file.close();
std::cout << "EOF FOUND" << std::endl
Could anyone provide with an idea of determining whether a file has been completely written or not?

EOF is simply C++'s way of telling you there is no more data. There's no EOF "character" that you can use to check if the file is completely written.
The way this is typically accomplished is to transfer the file over with one name, i.e. myfile.txt.transferring, and once the transfer is complete, move the file on the target host (back to something like myfile.txt). You could do the same by using separate directories.

Neither C nor C++ have a standard way to determine if the file is still open for writing by another process. We have a similar situation: a server that sends us files and we have to pick them up and handle as soon as possible. For that we use Linux's inotify subsystem, with a watch configured for IN_CLOSE_WRITE events (file was closed after having been opened for writing), which is wrapped in boost::asio::posix::stream_descriptor for convenient asynchronicity.
Depending on the OS, you may have a similar facility. Or just lsof as already suggested.

All finite files have an end. If a file is being written by one process, and (assuming the OS allows it) simultaneously read (faster than it is being written) by another process,then the reading process will see an EOF when it has read all the characters that have been written.
What would probably work better is, if you can determine a length of time during which you can guarantee that you'll receive a significant number of bytes and write them to a file (beware OS buffering), then you can walk the directory once per period, and any file that has changed its file size can be considered to be unfinished.
Another approach would require OS support: check what files are open by the receiving process, with a tool like lsof. Any file open by the receiver is unfinished.

In C, and I think it's the same in C++, EOF is not a character; it is a condition a file is (or is not) in. Just like media removed or network down is not a character.

Related

How to get correct file size only on the completion of a detected file change, not at the beginning?

I'm using libuv's uv_fs_event_t to monitor file changes. And once a change is detected, I open the file in the callback uv_fs_event_cb.
However, my program requires to also get the full file size when opening the file, so I would know how much memory is to be allocated based on the file size. I found that no matter I use libuv's uv_fs_fstat or POSIX's stat/stat64, or fseek+ftell I never get the correct file size immediately. It's because when my program is opening the file, the file is still being updated.
My program runs in a tight single thread with callbacks so delay/sleep isn't the best option here (and no guaranteed correctness either).
Is there any way to handle this with or without leveraging libuv, so that I can, say hold off opening and reading the file, until the write to the file has completed? In other words, instead of immediately detects the start of a change of a file event, can I in some way detects a completion of a file change?
One approach is to have the writer create an intermediate file, and finish I/O by renaming it to the target file. e.g. this is what happens in most browsers, the file has an "downloading.tmp" name until download is complete to discourage you from opening it.
Another approach is to write/touch a "finished" file after writing the main target file, and wait to see that file before the reader starts his job.
Last option I can see, if the file format can be altered slightly, have the writer print the file size as first bytes of the file, then the reader can preallocate correctly even if the file is not fully written, and then it will insist on reading all the data.
Overall I'm suggesting instead of a completion event, make the writer produce any event that can be monitored after it has completed it's task, and have the reader wait/synchronize on that event.

fortran: wait to open a file until closed by another application

I have a fortran code which needs to read a series of ascii data files (which all together are about 25 Gb). Basically the code opens a given ascii file, reads the information and use it to do some operations, and then close it. Then opens another file, reads the information, do some operations, and close it again. And so on with the rest of ascii files.
Overall each complete run takes about 10h. I usually need to run several independent calculations with different parameters, and the way I do is to run each independent calculation sequentially, so that at the end if I have 10 independent calculations, the total CPU time is 100h.
A more rapid way would be to run the 10 independent calculations at the same time using different processors on a cluster machine, but the problem is that if a given calculation needs to open and read data from a given ascii file which has been already opened and it's being used by another calculation, then the code gives obviously an error.
I wonder whether there is a way to verify if a given ascii file is already being used by another calculation, and if so to ask the code to wait until the ascii file is finally closed.
Any help would be of great help.
Many thanks in advance.
Obamakoak.
Two processes should be able to read the same file. Perhaps action="read" on the open statement might help. Must the files be human readable? The I/O would very likely be much faster with unformatted (sometimes call binary) files.
P.S. If your OS doesn't support multiple-read access, you might have to create your own lock system. Create a master file that a process opens to check which files are in use or not, and to update said list. Immediately closing after a check or update. To handle collisions on this read/write file, use iostat on the open statement and retry after a delay if there is an error.
I know this is an old thread but I've been struggling with the same issue for my own code.
My first attempt was creating a variable on a certain process (e.g. the master) and accessing this variable exclusively using one-sided passive MPI. This is fancy and works well, but only with newer versions of MPI.
Also, my code seemed happy to open (with READWRITE status) files that were also open in other processes.
Therefore, the easiest workaround, if your program has file access, is to make use of an external lock file, as described here. In your case, the code might look something like this:
A process checks whether the lock file exists using the NEW statement, which fails if a file already exists. It will look something like:
file_exists = .true.
do while (file_exists)
open(STATUS='NEW',unit=11,file=lock_file_name,iostat=open_stat)
if (open_stat.eq.0) then
file_exists = .false.
open(STATUS='OLD',ACTION=READWRITE',unit=12,file=data_file_name,iostat=ierr)
if (ierr.ne.0) stop
else
call sleep(1)
end if
end do
The file is now opened exclusively by the current process. Do the operations you need to do, such as reading, writing.
When you are done, close the data file and finally the lock file
close(12,iostat=ierr)
if (ierr.ne.0) stop
close(11,status='DELETE',iostat=ierr)
if (ierr.ne.0) stop
The data file is now again unlocked for the other processes.
I hope this may be useful for other people who have the same problem.

Reading file that changes over time C++

I am going to read a file in C++. The reading itself is happening in a while-loop, and is reading from one file.
When the function reads information from the file, it is going to push this information up some place in the system. The problem is that this file may change while the loop is ongoing.
How may I catch that new information in the file? I tried out std::ifstream reading and changing my file manually on my computer as the endless-loop (with a sleep(2) between each loop) was ongoing, but as expected -- nothing happend.
EDIT: the file will overwrite itself at each new entry of data to the file.
Help?
Running on virtual box Ubuntu Linux 12.04, if that may be useful info. And this is not homework.
The usual solution is something along the lines of what MichaelH
proposes: the writing process opens the file in append mode, and
always writes to the end. The reading process does what
MichaelH suggests.
This works fine for small amounts of data in each run. If the
processes are supposed to run a long time, and generate a lot of
data, the file will eventually become too big, since it will
contain all of the processed data as well. In this case, the
solution is to use a directory, generating numbered files in it,
one file per data record. The writing process will write each
data set to a new file (incrementing the number), and the
reading process will try to open the new file, and delete it
when it has finished. This is considerably more complex than
the first suggestion, but will work even for processes
generating large amounts of data per second and running for
years.
EDIT:
Later comments by the OP say that the device is actually a FIFO.
In that case:
you can't seek, so MichaelH's suggestion can't be used
literally, but
you don't need to seek, since data is automatically removed
from the FIFO whenever it has been read, and
depending on the size of the data, and how it is written, the
writes may be atomic, so you don't have to worry about partial
records, even if you happen to read exactly in the middle of
a write.
With regards to the latter: make sure that both the read and
write buffers are large enough to contain a complete record, and
that the writer flushes after each record. And make sure that
the records are smaller than the size needed to guarantee
atomicity. (Historically, on the early Unix I know, this was
4096, but I would be surprised if it hasn't increased since
then. Although... Under Posix, this is defined by PIPE_BUF,
which is only guaranteed to be at least 512, and is only 4096
under modern Linux.)
Just read the file, rename the file, open the renamed file. do the processing of data to your system, and at the end of the loop close the file. After a sleep, re-open the file at the top of the white loop, rename it and repeat.
That's the simplest way to approach the problem and saves having to write code to process dynamic changes to the file during the processing stage.
To be absolutely sure you don't get any corruption it's best to rename the file. This guarantees that any changes from another process do not affect the processing. It may not be necessary to do this - it depends on the processing and how the file is updated. But it's a safer approach. A move or rename operation is guaranteed to be atomic - so there should be no concurreny issues if using this approach.
You can use inotify to watch file changes.
If you need simpler solution - read file attributes ( with stat(), and check last_write_time of a file ).
However you still may miss some file modification, while you'll be opening and rereading file. So if you have control over the application which writes to a file, i'd recommend you using something else to communicate between these processes, pipes for example.
To be more explicit, if you want tail-like behavior you'll want to:
Open the file, read in the data. Save the length. Close the file.
Wait for a bit.
Open the file again, attempt to seek to the last read position, read the remaining data, close.
rinse and repeat

About EOF and Reader/Writer synchronization

Blockquote
how does O.S know that writer is still writing. ?... What is workflow of EOF for file(closing file handle like ^D or ^z) ?
what happens if EOF is never written ?
what happens if readers reading rate is faster than writer's writing speed ? Can a rate mismatch result in deadlock ?
what can be other unwanted scenarios ?
How does O.S calculate EOF while reading file ?
-Nikhil
P.S: Current Operating system is windows but I don't mind learning interesting feature for the same on unix too.
Blockquote
More Edits and More info on the Question
Now that I know EOF is no character,So it cannot be written on the data of file. IF O.S. determines the EOF using File_size like what even #saurabh pointed.
(->) EOF while Reading ( Would probably be determined from file size which would stored in drive table of appropriate File system) )
So does process keeps polling File table for file size to determine EOF as there could be cases of not-fixed files size.
To my little knowledge, EOF is encountered when you read beyond the EOF(in our case file-size). Assume a situation where writer is writing intermittently and reader is reading blocks. SO if reader tries to read more than the available chunk would EOF be thrown ? But Writer has not signalled EOF yet ?
Untill the program dont close the file. OS assumes that the file can be read/written or both (depends on the mode of opening the file).
EOF is nothing, but OS knows it from the size of file. Lets say, your file is of size 100 bytes, and you asks to read from byte 99 and asks for 6 more bytes, then OS knows that file is till only 100th bytes, so it will return EOF.
If you are writing to a file from one process and reading from it from another, you may want to look into using pipes instead. Those are special files designed exactly for your purpose: you can only write at one end, read at the other end and the reader blocks or is notified if there isn't any data to read...
And yes, there is no special EOF marker. If using normal files and you don't like headaches much, just don't mess with them simultaneously from multiple processes.

How can I know I am the only person to have a file handle open?

I have a situation where I need to pick up files from a directory and process them as quickly as they appear. The process feeding files into this directory is writing them at a pretty rapid rate (up to a thousand a minute at peak times) and I need to pull them out and process them as they arrive.
One problem I've had is knowing that my C++ code has opened a file that the sending server has finished with -- that is, that the local FTP server isn't still writing to.
Under Solaris, how can I open a file and know with 100% certainty that no-one else has it open?
I should note that once the file has been written to and closed off it the other server won't open it again, so if I can open it and know I've got exclusive access I don't need to worry about checking that I'm still the only one with the file.
You can used flock() with operation LOCK_EX to ensure exclusive access. fcntl() is another possible way
#include <sys/file.h>
int flock(int fd, int operation);
EDIT: Two ways to do this, find an ftp server which locks the file during receiving.
I'm afraid you will not be 100% safe if you monitor the ftp server process, using pfiles or lsof (which is available here http://www.sunfreeware.com/) to make sure that no one else is accessing the files.
Maybe you can check the timestamps of the incomming files and if they havn't changed for a few minutes it would be safe to fetch,process or do something with them.
Is the process that feeds files into the directory is owned by you? If that is the case, then rename the file's extension to .working so that you don't pick up the file that is being used.
EDIT: Since you said it is solaris, write a shell script and use pfiles command to check if the process still uses the file you want to use. If not start processing the file.