fopen: is it good idea to leave open, or use buffer?

fopen: is it good idea to leave open, or use buffer? - c++

So I have many log files that I need to write to. They are created when program begins, and they save to file when program closes.
I was wondering if it is better to do:
fopen() at start of program, then close the files when program ends - I would just write to the files when needed. Will anything (such as other file io) be slowed down with these files being still "open" ?
OR
I save what needs to be written into a buffer, and then open file, write from buffer, close file when program ends. I imagine this would be faster?

Well, fopen(3) + fwrite(3) + fclose(3) is a buffered I/O package, so another layer of buffering on top of it might just slow things down.
In any case, go for a simple and correct program. If it seems to run slowly, profile it, and then optimize based on evidence and not guesses.

Short answer:
Big number of opened files shouldn't slow down anything
Writing to file will be buffered anyway
So you can leave those files opened, but do not forget to check the limit of opened files in your OS.

Part of the point of log files is being able to figure out what happened when/if your program runs into a problem. Quite a few people also do log file analysis in (near) real-time. Your second scenario doesn't work for either of these.
I'd start with the first approach, but with a high-enough level interface that you could switch to the second if you really needed to. I wouldn't view that switch as a major benefit of the high-level interface though -- the real benefit would normally be keeping the rest of the code a bit cleaner.

There is no good reason to buffer log messages in your program and write them out on exit. Simply write them as they're generated using fprintf. The stdio system will take care of the buffering for you. Of course this means opening the file (with fopen) from the beginning and keeping it open.

For log files, you will probably want a functional interface that flushes the data to disk after each complete message, so that if the program crashes (it has been known to happen), the log information is safe. Leaving stuff in standard I/O buffers means excavating the data from a core dump - which is less satisfactory than having the information on disk safely.
Other I/O really won't be affected by holding one - or even a few - log files open. You lose a few file descriptors, perhaps, but that is not often a serious problem. When it is a problem, you use one file descriptor for one log file - and you keep it open so you can log information. You might elect to map stderr to the log file, leaving that as the file descriptor that's in use.

It's been mentioned that the FILE* returned by fopen is already buffered. For logging, you should probably also look into using the setbuf() or setvbuf() functions to change the buffering behavior of the FILE*.
In particular, you might want to set the buffering mode to line-at-a-time, so the log file is flushed automatically after each line is written. You can also specify the size of the buffer to use.

Related

Is it more efficient to rewind a file than closing it and opening it up again?

I'm writing a little C++ program for myself. At the begining of it, I read a file all the way to the bottom, and later on, right before the program ends, I need to read that file again from the begining.
My question is, is it more efficient to have the file open during the execution (even thought I won't be using it) and just rewind it when I need it again, or should I close it the first time and then open it again when I need it?
Edit: Just to clarify, my question is not only related to the specific project that I'm working on. It is really small (less than 300 lines of code), so there won't be any noticeable performance difference. I'm asking about opening, closing and "rewinding" files in general, so it's aplicable to other big projects were performance and memory may actually matter

If you close and open the file, the OS definitely need to update system lock for the file and list of resources (opened files) of your process. Furthermore close and open operation are two systems calls (kernel calls) and system call is not cheap. Every system call require translating of virtual address.
Closing the file can (if there is any change) force writing the cache to the hard-disk, this means seek time about 15ms (physical move of the platter). It can be even worse in the case of network drive.
After closing the file, some properties need to be updated. FileSystem watcher may be launched.
An antivirus scanning may be triggered after closing the file, it depends on filename, path, antivirus brand.
Furthermore closing the file is a risk, that you are not able to open it again because of another process. For example Dropbox read every file in Dropbox folder after change. So closing and opening file does not generally work in Dropbox folder (Dropbox may be faster). And who knows how users use your application. Users are inventive and they share files you didn't think of.

You might be able to measure a fraction of gained efficiency in the range of a few nanoseconds if you fseek to the beginning of the file but I don't think this is worth it when you are only dealing with a single file.
Like others said: try to find other areas of code which you can optimize.

As with all performance issues, the final optimizations vary widely. Measure both implementations against a reasonable data set and take it from there.
As a design choice it may be simpler to cache the contents of the file in memory once it has been read the first time and then there is no need to re-read the contents. If the modified content is required then again, cache the modified data to forgo the second read.

Overwriting a file without the risk of a corrupt file

So often my applications want to save files to load again later. Having recently got unlucky with a crash, I want to write the operation in such a way that I am guaranteed to either have the new data, or the original data, but no a corrupted mess.
My first idea was to do something along the lines of (to save a file called example.dat):
Come up with a unique file name for the target directory, e.g. example.dat.tmp
Create that file and write my data to it.
Delete the original file (example.dat)
Rename ("Move") the temp file to where the original was (example.dat.tmp -> example.dat).
Then at load time the application can follow the following rules:
If no "example.dat" and no "example.dat.tmp", first run / new project, so load in the defaults / create new file.
If "example.dat" and no "example.dat.tmp", then load example.dat (normal load case)
If "example.dat.tmp" exists offer the user the chance to potentially recover data. If "example.dat" also exists, do not overwrite it without explicit user constant.
However, having done a little research, I found that as well as OS caching which I may be able to override with the file flush methods, some disk drives still then cache internally and may even lie to the OS saying they are done, so 4. could complete, the write is not actually written, and if the system goes down I have lost my data...
I am not sure the disk problem is actually solvable by an application, but are the general rules above the correct thing to do? Should I keep an old recovery copy of the file for longer to be sure, what are the guidelines regarding such things (e.g. acceptable disk usage, should the user choose, where to put such files, etc.).
Also how should I avoid potential conflict the user and other programs for "example.dat.tmp". I recall seeing a "~example.dat" sometimes from some other software, is that a better convention?

If the disk drives report back to the OS that the data is
physically on the disk, and it's not, then there's not much you
can do about it. A lot of disks do cache a certain number of
writes, and report them done, but such disks should have
a battery backup, and finish the physical writes no matter what
(and they won't loose data in case of a system crash, since they
won't even see it).
For the rest, you say you've done some research, so you no doubt
know that you can't use std::ofstream (nor FILE*) for this;
you have to do the actual writes at the system level, and open
the files with special attributes for them to ensure full
synchronization. Otherwise, the operations can stick around in
the OS buffering for a while. And that as far as I know,
there's no way of ensuring such synchronization for a rename.
(But I'm not sure that it's necessary, if you always keep two
versions: my usual convention in such cases is to write to
a file "example.dat.new", then when I'm done writing, delete
any file named "example.dat.bak", rename "example.dat" to
"example.dat.bak", and then rename "example.dat.new" to
"example.dat". Given this, you should be able to figure out
what did or did not happen, and find the correct file
(interactively, if need be, or insert an initial line with the
timestamp).

You should lock the actual data file while you write its substitute, if there's a chance that a different process could be going through the same protocol that you are describing.
You can use flock for the file lock.
As for your temp file name, you could make your process ID part of it, for instance "example.dat.3124," No other simultaneously-running process would generate the same name.

Is file ready to read immediately after std::ofstream is destroyed?

Basically I have following workflow (through console application):
Read binary file (std::ifstream::read)
Do something with data read
Write back to same file (std::ofstream::write), overwriting what
was there before.
Now, if I run this whole console program 1000 times through a shell script (always using same file), is it safe to assume that read operation will not conflict with previously run program trying to write to file? Or I need to wait between executions (how long???)? Can I reliably determine if file is ready?
I know it is not the best design, just want to know if that will work reliably (trying to quickly gather some statistics - inputs are different, but output file is always same - needs to be read, info needs to be processed, then it needs to be updated (at this point simply overwriting it) ).
EDIT:
It looks like the problem with output being wrong is not related to OS based on answers, the read/write I do looks like:
//read
std::ifstream input(fname,std::ios_base::binary);
while(input)
{
unsigned value;
input.read(reinterpret_cast<char*>(&value),sizeof(unsigned));
....
}
input.close();
...
//write
std::ofstream output(fname,std::ios_base::binary);
for(std::map<unsigned,unsigned>::const_iterator iter =originalMap.begin();iter != originalMap.end();++iter)
{
unsigned temp = iter->first;
output.write(reinterpret_cast<char*>(&temp),sizeof(unsigned));
temp = iter->second;
output.write(reinterpret_cast<char*>(&temp),sizeof(unsigned));
}

They are running sequentially, essentially, shell script runs same console app in a loop...
Then, on any "normal" operating system, there should be no problem.
When the application terminates the streams are destroyed, so all the data that may be kept in cache by the C/C++ streams is written in the underlying OS streams, that are then closed.
Whether the OS does some more caching is irrelevant - caching done by the operating system is transparent to applications, so, as far as applications are concerned, the data is now written in the file. If it is actually written on disk is of no concern here - the applications reading from the file will see the data in it anyway.
If you think about it, without such a guarantee it would be complicated to do reliably any work on a computer! :)
Update:
std::ofstream output(fname,std::ios_base::binary);
You should truncate the output file before writing on it, otherwise, if the input file was longer than the output, old data will still be lingering at the end of the file:
std::ofstream output(fname,std::ios_base::binary | std::ios_base::trunc);

Check parameters of the fstream ctor. Some implementations have extension, that allows conveniently set set sharing modes.
If you ask exclusive read or write, that's what you get as long as you keep the stream open -- other similar operations can not happen either from different processes or the same with a different stream instance.
With pure standard it requires more hops, probably setting them in filebuf and replace the stock one. Look after it.
Use of sharing modes is the mainstream way to defend file consistency, so I suggest to use it in any case.
Certainly if you make sure that race conditions are handled, one process will not open the file before the other closed it, the result is good that way too.

How to check if a file is still being written?

How can I check if a file is still being written? I need to wait for a file to be created, written and closed again by another process, so I can go on and open it again in my process.

In general, this is a difficult problem to solve. You can ask whether a file is open, under certain circumstances; however, if the other process is a script, it might well open and close the file multiple times. I would strongly recommend you use an advisory lock, or some other explicit method for the other process to communicate when it's done with the file.
That said, if that's not an option, there is another way. If you look in the /proc/<pid>/fd directories, where <pid> is the numeric process ID of some running process, you'll see a bunch of symlinks to the files that process has open. The permissions on the symlink reflect the mode the file was opened for - write permission means it was opened for write mode.
So, if you want to know if a file is open, just scan over every process's /proc entry, and every file descriptor in it, looking for a writable symlink to your file. If you know the PID of the other process, you can directly look at its proc entry, as well.
This has some major downsides, of course. First, you can only see open files for your own processes, unless you're root. It's also relatively slow, and only works on Linux. And again, if the other process opens and closes the file several times, you're stuck - you might end up seeing it during the closed period, and there's no easy way of knowing if it'll open it again.

You could let the writing process write a sentinel file (say "sentinel.ok") after it is finished writing the data file your reading process is interested in. In the reading process you can check for the existence of the sentinel before reading the data file, to ensure that the data file is completely written.

#blu3bird's idea of using a sentinel file isn't bad, but it requires modifying the program that's writing the file.
Here's another possibility that also requires modifying the writer, but it may be more robust:
Write to a temporary file, say "foo.dat.part". When writing is complete, rename "foo.dat.part" to "foo.dat". That way a reader either won't see "foo.dat" at all, or will see a complete version of it.

You can try using inotify
http://en.wikipedia.org/wiki/Inotify
If you know that the file will be opened once, written and then closed, it would be possible for your app to wait for the IN_CLOSE_WRITE event.
However if the behaviour of the other application doing the writing of the file is more like open,write,close,open,write,close....then you'll need some other mechanism of determining when the other app has truly finished with the file.

File corruption detection and error handling

I'm a newbie C++ developer and I'm working on an application which needs to write out a log file every so often, and we've noticed that the log file has been corrupted a few times when running the app. The main scenarios seems to be when the program is shutting down, or crashes, but I'm concerned that this isn't the only time that something may go wrong, as the application was born out of a fairly "quick and dirty" project.
It's not critical to have to the most absolute up-to-date data saved, so one idea that someone mentioned was to alternatively write to two log files, and then if the program crashes at least one will still have proper integrity. But this doesn't smell right to me as I haven't really seen any other application use this method.
Are there any "best practises" or standard "patterns" or frameworks to deal with this problem?
At the moment I'm thinking of doing something like this -
Write data to a temp file
Check the data was written correctly with a hash
Rename the original file, and put the temp file in place.
Delete the original
Then if anything fails I can just roll back by just deleting the temp, and the original be untouched.

You must find the reason why the file gets corrupted. If the app crashes unexpectedly, it can't corrupt the file. The only thing that can happen is that the file is truncated (i.e. the last log messages are missing). But the app can't really jump around in the file and modify something elsewhere (unless you call seek in the logging code which would surprise me).
My guess is that the app is multi threaded and the logging code is being called from several threads which can easily lead to data corrupted before the data is written to the log.

You probably forgot to call fsync() every so often, or the data comes in from different threads without proper synchronization among them. Hard to tell without more information (platform, form of corruption you see).
A workaround would be to use logfile rollover, ie. starting a new file every so often.

I really think that you (and others) are wasting your time when you start adding complexity to log files. The whole point of a log is that it should be simple to use and implement, and should work most of the time. To that end, just write the log to an unbuffered stream (l;ike cerr in a C++ program) and live with any, very occasional in my experience, snafus.
OTOH, if you really need an audit trail of everything your app does, for legal reasons, then you should be using some form of transactional storage such as a SQL database.

Not sure if your app is multi-threaded -- if so, consider using Active Object Pattern (PDF) to put a queue in front of the log and make all writes within a single thread. That thread can commit the log in the background. All logs writes will be asynchronous, and in order, but not necessarily written immediately.
The active object can also batch writes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js