Is file ready to read immediately after std::ofstream is destroyed? - c++

Basically I have following workflow (through console application):
Read binary file (std::ifstream::read)
Do something with data read
Write back to same file (std::ofstream::write), overwriting what
was there before.
Now, if I run this whole console program 1000 times through a shell script (always using same file), is it safe to assume that read operation will not conflict with previously run program trying to write to file? Or I need to wait between executions (how long???)? Can I reliably determine if file is ready?
I know it is not the best design, just want to know if that will work reliably (trying to quickly gather some statistics - inputs are different, but output file is always same - needs to be read, info needs to be processed, then it needs to be updated (at this point simply overwriting it) ).
EDIT:
It looks like the problem with output being wrong is not related to OS based on answers, the read/write I do looks like:
//read
std::ifstream input(fname,std::ios_base::binary);
while(input)
{
unsigned value;
input.read(reinterpret_cast<char*>(&value),sizeof(unsigned));
....
}
input.close();
...
//write
std::ofstream output(fname,std::ios_base::binary);
for(std::map<unsigned,unsigned>::const_iterator iter =originalMap.begin();iter != originalMap.end();++iter)
{
unsigned temp = iter->first;
output.write(reinterpret_cast<char*>(&temp),sizeof(unsigned));
temp = iter->second;
output.write(reinterpret_cast<char*>(&temp),sizeof(unsigned));
}

They are running sequentially, essentially, shell script runs same console app in a loop...
Then, on any "normal" operating system, there should be no problem.
When the application terminates the streams are destroyed, so all the data that may be kept in cache by the C/C++ streams is written in the underlying OS streams, that are then closed.
Whether the OS does some more caching is irrelevant - caching done by the operating system is transparent to applications, so, as far as applications are concerned, the data is now written in the file. If it is actually written on disk is of no concern here - the applications reading from the file will see the data in it anyway.
If you think about it, without such a guarantee it would be complicated to do reliably any work on a computer! :)
Update:
std::ofstream output(fname,std::ios_base::binary);
You should truncate the output file before writing on it, otherwise, if the input file was longer than the output, old data will still be lingering at the end of the file:
std::ofstream output(fname,std::ios_base::binary | std::ios_base::trunc);

Check parameters of the fstream ctor. Some implementations have extension, that allows conveniently set set sharing modes.
If you ask exclusive read or write, that's what you get as long as you keep the stream open -- other similar operations can not happen either from different processes or the same with a different stream instance.
With pure standard it requires more hops, probably setting them in filebuf and replace the stock one. Look after it.
Use of sharing modes is the mainstream way to defend file consistency, so I suggest to use it in any case.
Certainly if you make sure that race conditions are handled, one process will not open the file before the other closed it, the result is good that way too.

Related

Is there any way to repair files corrupted by not properly closing boost::archive::binary_oarchive?

I ran a large processing job which produced lots of binary files as output. I think I realize now that my output data files got corrupted because I did not properly flush or close the boost::archive::binary_oarchive object before moving the data files to remote storage. I'm wondering if there is any way to repair the output data files by appending them with some special EOF thing, or whether I'm out of luck and I have to re-run the expensive job?
More specifically, my processing job dumps the binary data like so:
void dumpStuff() {
// some code
std::ofstream ofs(localFileName);
boost::archive::binary_oarchive boa(ofs);
boa << *data;
if (uploadToRemote) {
// code that uploads files to remote store
// does not run when I tested locally
}
}
What I think happens is that when I tested locally (and did not upload to remote), the boa object goes out of scope at the end of the dumpStuff function and so it's destructor gets called, which properly flushes the stream and closes the file. When uploading to remote store, however, the upload happens before the oba's destructor is called, so it is my belief that the stream was not properly flushed, resulting in a corrupt file. When I fetch the corrupted file from the store and attempt to load with boost::archive::binary_iarchive, I get an InputStreamError.
I know I can fix the problem by adding some braces around the boa stuff to force it to go out of scope before uploading to remote, however, this will only solve my problem if I re-run the big expensive job. So, my question is, is there some simple way to just append something to the end of my corrupt files to de-corrupt them? Some kind of EOF signal?
There may not be. Then again most of it will surely depend on the flushing behaviour of the underlying stream you used.
This is a one-off problem that only you have, so you will have to make a solution.
One way would be to look at the source code to figure out exactly what actions would be skipped due to the missing close. And then either compensate the missing input OR make the input-archive implementation more tolerant for corrupt/missing tails.
The other approach would be to use your own code WITH the flaw to write an archive, and then write the SAME archive but with the error fixed.
Just look at the difference in a hex editor. You may be lucky and find that the data missing from the archive is fixed. If so, just append it to any corrupt input stream and be glad. More likely you will have some (simple) variable data, like a checksum or a total size. In that case either try to generate the missing data, or hack the input-stream implementation to detect the required checksum.
CAVEAT: All of these suggest meddling with undocumented details, there will not be support, reliability depends solely on your own accuracy.
If you choose to "fake" checksums, be aware of the fact that it thwarts any builtin error-detection, so you might still read unreliable data (in case there was data corrupted in sotrage/transit)

Overwriting a file without the risk of a corrupt file

So often my applications want to save files to load again later. Having recently got unlucky with a crash, I want to write the operation in such a way that I am guaranteed to either have the new data, or the original data, but no a corrupted mess.
My first idea was to do something along the lines of (to save a file called example.dat):
Come up with a unique file name for the target directory, e.g. example.dat.tmp
Create that file and write my data to it.
Delete the original file (example.dat)
Rename ("Move") the temp file to where the original was (example.dat.tmp -> example.dat).
Then at load time the application can follow the following rules:
If no "example.dat" and no "example.dat.tmp", first run / new project, so load in the defaults / create new file.
If "example.dat" and no "example.dat.tmp", then load example.dat (normal load case)
If "example.dat.tmp" exists offer the user the chance to potentially recover data. If "example.dat" also exists, do not overwrite it without explicit user constant.
However, having done a little research, I found that as well as OS caching which I may be able to override with the file flush methods, some disk drives still then cache internally and may even lie to the OS saying they are done, so 4. could complete, the write is not actually written, and if the system goes down I have lost my data...
I am not sure the disk problem is actually solvable by an application, but are the general rules above the correct thing to do? Should I keep an old recovery copy of the file for longer to be sure, what are the guidelines regarding such things (e.g. acceptable disk usage, should the user choose, where to put such files, etc.).
Also how should I avoid potential conflict the user and other programs for "example.dat.tmp". I recall seeing a "~example.dat" sometimes from some other software, is that a better convention?
If the disk drives report back to the OS that the data is
physically on the disk, and it's not, then there's not much you
can do about it. A lot of disks do cache a certain number of
writes, and report them done, but such disks should have
a battery backup, and finish the physical writes no matter what
(and they won't loose data in case of a system crash, since they
won't even see it).
For the rest, you say you've done some research, so you no doubt
know that you can't use std::ofstream (nor FILE*) for this;
you have to do the actual writes at the system level, and open
the files with special attributes for them to ensure full
synchronization. Otherwise, the operations can stick around in
the OS buffering for a while. And that as far as I know,
there's no way of ensuring such synchronization for a rename.
(But I'm not sure that it's necessary, if you always keep two
versions: my usual convention in such cases is to write to
a file "example.dat.new", then when I'm done writing, delete
any file named "example.dat.bak", rename "example.dat" to
"example.dat.bak", and then rename "example.dat.new" to
"example.dat". Given this, you should be able to figure out
what did or did not happen, and find the correct file
(interactively, if need be, or insert an initial line with the
timestamp).
You should lock the actual data file while you write its substitute, if there's a chance that a different process could be going through the same protocol that you are describing.
You can use flock for the file lock.
As for your temp file name, you could make your process ID part of it, for instance "example.dat.3124," No other simultaneously-running process would generate the same name.

How to check if a file is still being written?

How can I check if a file is still being written? I need to wait for a file to be created, written and closed again by another process, so I can go on and open it again in my process.
In general, this is a difficult problem to solve. You can ask whether a file is open, under certain circumstances; however, if the other process is a script, it might well open and close the file multiple times. I would strongly recommend you use an advisory lock, or some other explicit method for the other process to communicate when it's done with the file.
That said, if that's not an option, there is another way. If you look in the /proc/<pid>/fd directories, where <pid> is the numeric process ID of some running process, you'll see a bunch of symlinks to the files that process has open. The permissions on the symlink reflect the mode the file was opened for - write permission means it was opened for write mode.
So, if you want to know if a file is open, just scan over every process's /proc entry, and every file descriptor in it, looking for a writable symlink to your file. If you know the PID of the other process, you can directly look at its proc entry, as well.
This has some major downsides, of course. First, you can only see open files for your own processes, unless you're root. It's also relatively slow, and only works on Linux. And again, if the other process opens and closes the file several times, you're stuck - you might end up seeing it during the closed period, and there's no easy way of knowing if it'll open it again.
You could let the writing process write a sentinel file (say "sentinel.ok") after it is finished writing the data file your reading process is interested in. In the reading process you can check for the existence of the sentinel before reading the data file, to ensure that the data file is completely written.
#blu3bird's idea of using a sentinel file isn't bad, but it requires modifying the program that's writing the file.
Here's another possibility that also requires modifying the writer, but it may be more robust:
Write to a temporary file, say "foo.dat.part". When writing is complete, rename "foo.dat.part" to "foo.dat". That way a reader either won't see "foo.dat" at all, or will see a complete version of it.
You can try using inotify
http://en.wikipedia.org/wiki/Inotify
If you know that the file will be opened once, written and then closed, it would be possible for your app to wait for the IN_CLOSE_WRITE event.
However if the behaviour of the other application doing the writing of the file is more like open,write,close,open,write,close....then you'll need some other mechanism of determining when the other app has truly finished with the file.

fopen: is it good idea to leave open, or use buffer?

So I have many log files that I need to write to. They are created when program begins, and they save to file when program closes.
I was wondering if it is better to do:
fopen() at start of program, then close the files when program ends - I would just write to the files when needed. Will anything (such as other file io) be slowed down with these files being still "open" ?
OR
I save what needs to be written into a buffer, and then open file, write from buffer, close file when program ends. I imagine this would be faster?
Well, fopen(3) + fwrite(3) + fclose(3) is a buffered I/O package, so another layer of buffering on top of it might just slow things down.
In any case, go for a simple and correct program. If it seems to run slowly, profile it, and then optimize based on evidence and not guesses.
Short answer:
Big number of opened files shouldn't slow down anything
Writing to file will be buffered anyway
So you can leave those files opened, but do not forget to check the limit of opened files in your OS.
Part of the point of log files is being able to figure out what happened when/if your program runs into a problem. Quite a few people also do log file analysis in (near) real-time. Your second scenario doesn't work for either of these.
I'd start with the first approach, but with a high-enough level interface that you could switch to the second if you really needed to. I wouldn't view that switch as a major benefit of the high-level interface though -- the real benefit would normally be keeping the rest of the code a bit cleaner.
There is no good reason to buffer log messages in your program and write them out on exit. Simply write them as they're generated using fprintf. The stdio system will take care of the buffering for you. Of course this means opening the file (with fopen) from the beginning and keeping it open.
For log files, you will probably want a functional interface that flushes the data to disk after each complete message, so that if the program crashes (it has been known to happen), the log information is safe. Leaving stuff in standard I/O buffers means excavating the data from a core dump - which is less satisfactory than having the information on disk safely.
Other I/O really won't be affected by holding one - or even a few - log files open. You lose a few file descriptors, perhaps, but that is not often a serious problem. When it is a problem, you use one file descriptor for one log file - and you keep it open so you can log information. You might elect to map stderr to the log file, leaving that as the file descriptor that's in use.
It's been mentioned that the FILE* returned by fopen is already buffered. For logging, you should probably also look into using the setbuf() or setvbuf() functions to change the buffering behavior of the FILE*.
In particular, you might want to set the buffering mode to line-at-a-time, so the log file is flushed automatically after each line is written. You can also specify the size of the buffer to use.

Ensuring a file is flushed when file created in external process (Win32)

Windows Win32 C++ question about flushing file activity to disk.
I have an external application (ran using CreateProcess) which does some file creation. i.e., when it returns it will have created a file with some content.
How can I ensure that the file the process created was really flushed to disk, before I proceed?
By this I mean not the C++ buffers but really flushing disk (e.g. FlushFileBuffers).
Remember that I don't have access to any file HANDLE - this is all of course hidden inside the external process.
I guess I could open up a handle of my own to the file and then use FlushFileBuffers, but it's not clear this would work (since my handle doesn't actually contain anything which needs flushing).
Finally, I want this to run in non-admin userspace so I cannot use FlushFileBuffers on a whole volume.
Any ideas?
UPDATE: Why do I think this is a problem?
I'm working on a data backup application. Essentially it has to create some files as described. It then has to update it's internal DB (using SQLite embedded DB).
I recently had a data corruption issue which occurred during a bluescreen (the cause of which was unrelated to my app).
What I'm concerned about is application integrity during a system crash. And yes, I do care about this because this app is a data backup app.
The use case I'm concerned about is this:
A small data file is created using external process. This write is waiting in the OS cache to be written to disk.
I update the DB and commit. This is a disk activity. This write is also waiting in the OS cache.
A system failure occurs.
As I see it, we're now in a potential race condition. If "1" gets flushed and "2" doesn't then we're fine (as the DB transact wasn't then committed). If neither gets flushed or both get flushed then we're also OK.
As I understand it, the writes will be non-deterministic. i.e., I'm not aware that the OS will guarantee to write "1" before "2". (Am I wrong?)
So, if "2" gets flushed, but "1" doesn't then we have a problem.
What I observed was that the DB was correctly updated, but that the file had garbage in: the last 2 thirds of the data was binary "zeroes". Now, I don't know what it looks like when you have a file part flushed at the time of bluescreen, but I wouldn't be surprised if it looked like that.
Can I guarantee this is the cause? No I cannot guarantee this. I'm just speculating. It could just be that the file was "naturally" corrupted due to disk failure or as a result of the blue screen.
With regards to performance, this is something I believe I can deal with.
For example, the default behaviour of SQLite is to do a full file flush (using FlushFileBuffers) every time you commit a transaction. They are quite clear that if you don't do this then at the time of system crash, you might have a corrupted DB.
Also, I believe I can mitigate the performance hit by only flushing at "checkpoints". For example, writing 50 files, flushing the lot and then writing to the DB.
How likely is all this to be a problem? Beats me. But then my app might well be archiving at or around the time of system failure so it might be more likely that you think.
Hope that explains why I wan't to do this.
Why would you want this? The OS will make sure that the data is flushed to the disk in due time. If you access it, it will either return the data from the cache or from disk, so this is transparent for you.
If you need some safety in case of disaster, then you must call FlushFileBuffers, for example by creating a process with admin rights after running the external process. But that can severely impact the performance of the whole machine.
Your only other option is to modify the source of the other process.
[EDIT] The most simple solution is probably to copy the file in your process and then flush the copy (since you have the handle). Save the copy under a name which says "not committed in the database".
Then update the database. Write into the database, "updated from file ...". If this entry already exists next time, don't update the database and skip this step.
Flush the database to disk.
Rename the file to "file has been processed into database". Rename is an atomic operation (so it either happens or not).
If you can't think of a good filename for the different states, then use subfolders and move the file between them.
Well, there are no attractive options here. There is no documented way to retrieve the file handle you need from the process. Although there are undocumented ones, go there (via DuplicateHandle) only with careful consideration.
Yes, calling FlushFileBuffers on a volume handle is the documented way. You can avoid the privilege problem by letting a service make the call. Talk to it from your app with one of the standard process interop mechanisms. A named pipe whose name is prefixed with Global\ is probably the easiest way to get that going.
After your update I think http://sqlite.org/atomiccommit.html gives you the answers you need.
The way SQLite ensures that everything is flushed to disc works. So it works for you as well - take a look at the source.