I'm making an File class that uses fstream to read/write to a file. I have no issues in terms of functionality but rather in best practice regarding the lifetime of the fstream object.
Is it better to have an fstream object stored as a member variable that gets created for each new File(path), and use that fstream over the lifetime of each File instance?
Or, for each individual function that I can call on a File instance (readBytes(), writeBytes(), exists(), isDirectory(), etc.), should I declare a local ifstream/ofstream, do what needs to be done, and, when the function exists, they go out of scope and are auto-closed?
In the first case, I fear that if I have many many files "open" there will be a penalty for having that many streams active at the same time.
In the second case, it just seems inefficient to continually create and destroy fstream objects.
Anyone with experience in the matter who can comment would be greatly appreciated!
Thanks,
Jon.
You have nailed the two issues right on the head. Generally the most efficient approach is to keep the files around (open) until you run the risk of running out of file descriptors. On some systems file descriptors aren't recycled immediately, so you need to limit your use of descriptors by closing some files before you would run out.
If you know something more about which files are read/written more often, which are only read/written in large chunks, etc. you could close down those for which the penalty of having to open them again is relatively small.
Related
I have multiple threads, and I want each of them to process a part of my file. Can I have a single ifstream object for that and make them read concurrently read different parts ? The parts are non overlapping, so the same line will not be processed by two threads. If yes, how to get multiple cursors ?
A single std::ifstream is associated with exactly one cursor (there's a seekg and tellg method associated with the std::ifstream directly).
If you want the same std::ifstream object to be shared accross multiple threads, you'll have to have some sort of synchronization mechanism between the threads, which might defeat the purpose (in each thread, you'll have to lock, seek, read and unlock each time).
To solve your problem, you can open one std::ifstream to the same file per thread. In each thread, you'd seek to whatever position you want to start reading from. This would only require you to be able to "easily" compute the seek position for each thread though (Note: this is a pretty strong requirement).
C++ file streams are not guaranteed to be thread safe (see e.g. this answer).
The typical solution is anyway to open separate streams on the same file, each instance comes with their own "cursor". However, you need to ensure shared access, and concurrency becomes platform specific.
For ifstream (i.e. only reading from the file), the concurrency issues are usually tame. Even if someone else modifies the file, both streams might see different content, but you do have some kind of eventual consistency.
Reads and writes are usually not atomic, i.e. you might read only part of a write. Writes might not even execute in the order they are issued (see write combining).
Looking at FILE struct it seems like there is a pointer inside FILE, char* curp, pointing to the current active pointer, which may mean that for each FILE object, you'd have one particular part of the file.
This being in C, I don't know how ifstream works and if it uses FILE object/it is built like a FILE object. Might not help you at all, but I thought it would be interesting to share this little information, and that it could may be help someone.
Recently I ran into a problem at work where you have two functions; one opens a file descriptor (which is a local variable in the function), and passes it to another function where it is used for reading or writing. Now, when one of the operations read/write fails the function that was doing the read/write closes this file descriptor, and returns.
The question is, whose responsibility is to close the file descriptor, or let's say do cleanup:
the function which created the fd
the function which experienced the error while read/write
Is there a design rule for these kind of cases; let's say creation and cleanup.
BTW, the problem was that both functions attempted to close the fd, which resulted in a crash on the second call to close.
There are two parts to this answer — the general design issue and the detailed mechanics for your situation.
General Design
Handling resources such as file descriptors correctly, and making sure they are released correctly, is an important design issue. There are multiple ways to manage the problem that work. There are some others that don't.
Your tags use C and C++; be aware that C++ has extra mechanisms available to it.
In C++, the RAII — Resource Acquisition Is Initialization — idiom is a great help. When you acquire a resource, you ensure that whatever acquires the resource initializes a value that will be properly destructed and will release the resource when destructed.
In both languages, it is generally best if the function responsible for allocating the resource also releases it. If a function opens a file, it should close it. If a function is given an open file, it should not close the file.
In the comments, I wrote:
Generally, the function that opened the file should close it; the function that experienced the error should report the error, but not close the file. However, you can work it how you like as long as the contract is documented and enforced — the calling code needs to know when the called code closed the file to avoid double closes.
It would generally be a bad design for the called function to close the file sometimes (on error), but not other times (no error). If you must go that way, then it is crucial that the calling function is informed that the file is closed; the called function must return an error indication that tells the calling code that the file is no longer valid and should neither be used nor closed. As long as the information is relayed and handled, there isn't a problem — but the functions are harder to use.
Note that if a function is designed to return an opened resource (it is a function that's responsible for opening a file and making it available to the function that called it), then the responsibility for closing the file falls on the code that calls the opening function. That is a legitimate design; you just have to make sure that there is a function that knows how to close it, and that the calling code does close it.
Similar comments apply to memory allocation. If a function allocates memory, you must know when the memory will be freed, and ensure that it is freed. If it was allocated for the purposes of the current function and functions it calls, then the memory should be released before return. If it was allocated for use by the calling functions, then the responsibility for release transfers to the calling functions.
Detailed mechanics
Are you sure you're using file descriptors and not FILE * (file streams)? It's unlikely that closing a file descriptor twice would cause a crash (error, yes, but not a crash). OTOH, calling fclose() on an already closed file stream could cause problems.
In general, in C, you pass file descriptors, which a small integers, by value, so there isn't a way to tell the calling function that the file descriptor is no longer valid. In C++, you could pass them by reference, though it is not conventional to do so. Similarly with FILE *; they're most usually passed by value, not by reference, so there isn't a way to tell the calling code that the file is not usable any more by modifying the value passed to the function.
You can invalidate a file descriptor by setting it to -1; that is never a valid file descriptor. Using 0 is a bad idea; it is equivalent to using standard input. You can invalidate a file stream by setting it to 0 (aka NULL). Passing the null pointer to functions that try to use the file stream will tend to cause crashes. Passing an invalid file descriptor typically won't cause crashes — the calls may fail with EBADF set in errno, but that's the limit of the damage, usually.
Using file descriptors, you will seldom get a crash because the file descriptor is no longer valid. Using file streams, all sorts of things can go wrong if you try using an invalid file stream pointer.
I am trying to learn C++ fstream, ifstream, ofstream. On Half way through my project I learnt that if we are accessing the same file with ofstream and ifstream for read & write, its better to close the streams, before using another.
Like
ofstream write_stream(...);
ifstream read_stream(....);
// Accessing pointers using both read_stream, write_stream interchangebly
read_stream.read(....);
write_stream.write(....);
read_stream.close();
write_stream.close();
///
In the above case, I guess both the stream use the same pointer in the file, so I need to be aware of the pointer movement, so I have to seek each and every time I try to read() or write().
I guess, I am right, so far.
To avoid any more confusion, I have decided to use this format
fstream read_write_stream("File.bin",ios::in|ios::out|ios::binary);
read_write_stream.seekp(pos);
read_write_stream.seekg(pos);
read_write_stream.tellp();
read_write_stream.tellg()
read_write_stream.read(...);
read_write_stream.write(...);
read_write_stream.close()
Is there any bug inducing feature, that I should be aware of in the above program??? Please advice
Though I don't know if the standard explicitly refers to this case, I don't think that a C++ compiler can promise you what will happen in the case where you use several streams to change the same external resource (file in this case). In addition to the fstream internal implementation, it depends on the OS and the hardware you're writing to. If I'm correct, it makes this operations, by default, undefined behavior.
If you use two different streams, most likely each will manage its own put and get pointer, each will have buffering which means that if you don't use flush(), you won't be able to determine what is the order the operations will be done on the file.
If you use one stream to both read and write, I think the behavior will be predictable and more trivial understand.
So I read this interview with John Carmack in Gamasutra, in which he talks about what he calls "live C++ objects that live in memory mapped files". Here are some quotes:
JC: Yeah. And I actually get multiple benefits out of it in that... The last iOS Rage project, we shipped with some new technology that's using some clever stuff to make live C++ objects that live in memory mapped files, backed by the flash file system on here, which is how I want to structure all our future work on PCs.
...
My marching orders to myself here are, I want game loads of two seconds on our PC platform, so we can iterate that much faster. And right now, even with solid state drives, you're dominated by all the things that you do at loading times, so it takes this different discipline to be able to say "Everything is going to be decimated and used in relative addresses," so you just say, "Map the file, all my resources are right there, and it's done in 15 milliseconds."
(Full interview can be found here)
Does anybody have any idea what Carmack is talking about and how you would set up something like this? I've searched the web for a bit but I can't seem to find anything on this.
The idea is that you have all or part of your program state serialized into a file at all times by accessing that file via memory mapping. This will require you not having usual pointers because pointers are only valid while your process lasts. Instead you have to store offsets from the mapping start so that when you restart the program and remap the file you can continue working with it. The advantage of this scheme is that you don't have separate serialization which means you don't have extra code for that and you don't need to save all the state at once - instead your (all or most of) program state is backed by the file at all times.
You'd use placement new, either directly or via custom allocators.
Look at EASTL for an implementation of (subset) STL that is specifically geared to working well with custom allocation schemes (such as required for games running on embedded systems or game consoles).
A free subset of EASTL is here:
http://gpl.ea.com/
a clone at https://github.com/paulhodge/EASTL
We use for years something we call "relative pointers" which is some kind of smart pointer. It is inherently nonstandard, but works nice on most platforms. It is structured like:
template<class T>
class rptr
{
size_t offset;
public:
T* operator->() { return reinterpret_cast<T*>(reinterpret_cast<char*>(this)+offset); }
};
This requires that all objects are stored into the same shared memory (which can be a filemap too). It also usually requires us to only store our own compatible types in there, as well as havnig to write own allocators to manage that memory.
To always have consistent data, we use snapshots via COW mmap tricks (which work in userspace on linux, no idea about other OSs).
With the big move to 64bit we also sometimes just use fixed mappings, as the relative pointers incur some runtime overhead. With usually 48bits of address space, we chose a reserved memry area for our applications that we always map such a file to.
This reminds me of a file system I came up with that loaded level files of CD in an amazingly short time (it improved the load time from 10s of seconds to near instantaneous) and it works on non-CD media as well. It consisted of three versions of a class to wrap the file IO functions, all with the same interface:
class IFile
{
public:
IFile (class FileSystem &owner);
virtual Seek (...);
virtual Read (...);
virtual GetFilePosition ();
};
and an additional class:
class FileSystem
{
public:
BeginStreaming (filename);
EndStreaming ();
IFile *CreateFile ();
};
and you'd write the loading code like:
void LoadLevel (levelname)
{
FileSystem fs;
fs.BeginStreaming (levelname);
IFile *file = fs.CreateFile (level_map_name);
ReadLevelMap (fs, file);
delete file;
fs.EndStreaming ();
}
void ReadLevelMap (FileSystem &fs, IFile *file)
{
read some data from fs
get names of other files to load (like textures, object definitions, etc...)
for each texture file
{
IFile *texture_file = fs.CreateFile (some other file name)
CreateTexture (texture_file);
delete texture_file;
}
}
Then, you'd have three modes of operation: debug mode, stream file build mode and release mode.
In each mode, the FileSystem object would create different IFile objects.
In debug mode, the IFile object just wrapped the standard IO functions.
In stream file building, the IFile object also wrapped the standard IO but had the additional functions of writing to the stream file (the owner FileSystem opened the stream file) every byte that was read, and writing the return value of any file pointer position queries (so if anything needed to know a file size, that information is written to the stream file). This would sort of concatenate the various files into one big file, but only the data that was actually read.
The release mode would create an IFile that did not open files or seek within files, it just read from the streaming file (as opened by the owner FileSystem object).
This means that in release mode, all data is read in one sequential series of reads (the OS would buffer it nicely) rather than lots of seeks and reads. This is ideal for CDs where seek times are really slow. Needless to say, this was developed for a CD based console system.
A side effect is that the data is stripped of unnecessary meta data that would normally be skipped.
It does have drawbacks - all the data for a level is in one file. These can get quite large and the data can't be shared between files, if you had a set of textures, say, that were common across two or more levels, the data would be duplicated in each stream file. Also, the load process must be the same every time the data is loaded, you can't conditionally skip or add elements to a level.
As Carmack indicates many games (and other applications) loading code is structured lika a lot of small reads and allocations.
Instead of doing this you do a single fread (or equivalent) of say a level file into memory and just fixup the pointers afterwards.
I'm working with two independent c/c++ applications on Windows where one of them constantly updates an image on disk (from a webcam) and the other reads that image for processing. This works fine and dandy 99.99% of the time, but every once in a while the reader app is in the middle of reading the image when the writer deletes it to refresh it with a new one.
The obvious solution to me seems to be to have the reader put some sort of a lock on the file so that the writer can see that it can't delete it and thus spin-lock on it until it can delete and update. Is there anyway to do this? Or is there another simple design pattern I can use to get the same sort of constant image refreshing between two programs?
Thanks,
-Robert
Try using a synchronization object, probably a mutex will do. Whenever a process wants to read or write to a file it should first acquire the mutex lock.
Yes, a locking mechanism would help. There are, unfortunately, several to choose from. Linux/Unix e.g. has flock (2), Windows has a similar (but different) mechanism.
Another (somewhat hacky) solution is to just write the file under a temporary name, then rename it. Many filesystems guarantee that a rename is atomic, so this may work. This however depends on the fs, so it's a bit hacky.
If you are willing to go with the Windows API, opening the file with CreateFile and passing in 0 for the dwShareMode will not allow any other application to open the file.
From the documentation:
Prevents other processes from opening a file or device if they
request delete, read, or write access.
Then you'd have to use ReadFile, WriteFile, CloseFile, etc rather than the C standard library functions.
Or, as a really simple kludge, the reader creates a temp file (says, .lock) before starting reading and deletes it afterwards. The write doesn't manipulate the file so long as .lock exists.
That's how Open Office does it (and others) and it's probably the simplest to implement, no matter which platform.
Joe, many solutions have been proposed; I commented on some of them but I'd like to chime in with an overall view and some specifics and recommendations:
You have the following options:
use filesystem locking: under Windows have both the reader and writer open (and create with the CREATE_ALWAYS disposition, respectively) the shared file in OF_SHARE_EXCLUSIVE mode; have both the reader and writer ready to handle ERROR_SHARING_VIOLATION and retry after some predefined period of time (e.g. 250ms)
use file renaming to essentially transfer file ownership: have the writer create a writer-private file (e.g. shared_file.tmpwrite), write to it, close it, then make it publicly available to the reader by renaming it to an agreed-upon "public" name (e.g. simply shared-file); have the reader periodically test for the existence of a file with the agreed-upon "public" name (e.g. shared-file) and, when one is found, attempt to first rename it to a reader-private name (e.g. shared_file.tmpread) before having the reader open it (under the reader-private name); under Windows use MOVEFILE_REPLACE_EXISTING; the rename operation does not have to be atomic for this to work
use other forms of interprocess communication (IPC): under Windows you can create a named mutex, and have both the reader and writer attempt to create (the existing mutex will be returned if it already exists) then acquire the named mutex before opening the shared file for reading or writing
implement your own filesystem-backed locking: take advantage of open(O_CREAT|O_EXCL) or, under Windows, of the CREATE_NEW disposition to atomically create an application lock file; unlike OF_SHARE_EXCLUSIVE approach above, it would be up to you to deal with stale lock files (i.e. lock files left by a process which did not shut down gracefully such as after a crash.)
I would implement method 1.
Method 2 would also work, but it is in a sense reinventing the wheel.
Method 3 arguably has the advantage of allowing your reader process to wait on the writer process and vice-versa, eliminating the need for the arbitrary sleep delays between the retries of methods 1 and 2 (polling); however, if you are OK with polling then you should still use method 1
Method 4 is listed for completeness only, as it is complex to implement (when the lock file is detected to be stale, e.g. by checking whether the PID contained therein still exists, multiple processes can potentially be competing for its removal, which introduces a race condition requiring a second lock, which in turn can become stale etc. etc., e.g.:
process A creates the lock file but dies without removing the lock file
process A restarts and tries to acquire the lock file but realizes it is stale
process B comes out of a sleep delay and also tries to acquire the lock file but realizes it is stale
process A removes the lock file, which it knew to be stale, and recreates it essentially reacquiring the lock
process B removes the lock file, which it (still) thinks is stale (although at this point it is no longer stale and owned by process A) -- violation
Instead of deleting images, what about appending them to the end of the file? This would allow you to keep adding to the file while the reader is still operating without destroying the file. The reader can then delete the image when it's done with it (provided it is necessary) and move onto the next image. Or, the other option would be store the image in a buffer, for writing, and you test the file pointer. If it's set to the head of the file then you can go ahead and write from the buffer to the file. Otherwise, wait until reader finishes and puts the pointer back at the head of the file.
couldn't you store a few images? ('n' sounds like a good number :-)
Not too many to fill your disk, but surely 3 would be enough? if not, you are writing faster than you can process and have a fundamental problem anyhoo (tune to discover 'n').
Cyclically overwrite.