Design: Large archive file editor, file mapping - c++

I'm writing an editor for large archive files (see below) of 4GB+, in native&managed C++.
For accessing the files, I'm using file mapping (see below) like any sane person. This is absolutely great for reading data, but a problem arises in actually editing the archive.
File mapping does not allow resizing a file while it's being accessed, so I don't know how I should proceed when the user wants to insert new data in the file (which would exceed the file's original size, when it was mapped.)
Should I remap the whole thing every time? That's bound to be slow. However, I'd want to keep the editor real-time with exclusive file access, since that simplifies the programming a lot, and won't let the file get screwed by other applications while being modified. I wouldn't want to spend an eternity working on the editor; It's just a simple dev-tool for the actual project I'm working on.
So I'd like to hear how you've handled similar cases, and what other archiving software and especially other games do to solve this?
To clarify:
This is not a text file, I'm writing a specific binary archive file format. By which I mean a big file that contains many others, in directories. Custom archive files are very common in game usage for a number of reasons. With my format, I'm aiming to a similar (but somewhat simpler) structure as with Valve Software's GCF format - I would have used the GCF format as it is, but unfortunately no editor exists for the format, although there are many great implementations for reading them, like HLLib.
Accessing the file must be fast, as it is intended for storing game resources. So it's not a database. Database files would be contained inside it, along with GFX, SFX etc. files.
"File mapping" as talked here is a specific technique on the Windows platform, which allows direct access to a large file through creating "views" to parts of it, see here: http://msdn.microsoft.com/en-us/library/aa366556(VS.85).aspx - This technique allows minimal latency and memory usage and is a no-brainer for accessing any large files.
So this does not mean reading the whole 4GB file into memory, it's exactly the contrary.

What do you mean by 'editor software'? If this is a text file, have you tried existing production-quality editors, before writing your own? If it's a file storing binary data, have you considered using an RDBMS and manipulating its contents using SQL statements?
If you absolutely have to write this from scratch, I'm not sure that mmapping is the way to go. Mmapping a huge file will put a lot of pressure on your machine's VM system, and unless there are many editing operations all over the file its efficiency may lag behind a simple read/write scheme. Worse, as you say, you have problems when you want to extend the file.
Instead, maintain buffer windows to the file's data, which the user can modify. When the user decides to save the file, traverse sequentially the file and the edited buffers to create the new file image. If you have disk space it's easier to write a new file (especially if a buffer's size has changed), otherwise you need to be clever on how you read-ahead existing data, before you overwrite it with the new contents.
Alternatively, you can keep a journal of editing operations. When the user decides to save the file, perform a topological sort on the journal and play it on the existing file to create the new one.
For exclusive file access use the file locking of your operating system or implement application-level locking (if only your editor will touch these files). Depending on mmap for exclusive access constrains your implementation choices.

Mapping the file is create for actually accessing the data, but I think you need another abstraction that represents the structure of the file. There are various ways of doing this, but consider representing the file as a sequence of 'extents'.
To start with the file is a single extent that is equivalent to the whole mapping. If the user then starts to edit the file, you would split the single extent into two at the edit point, and insert a new extent that contains the data the user has inserted. Modifications and deletes would also modify your view of the file by creating or modifying these extents.
Maybe you could examine the source code for one of the open source editors -- there are lots to choose from, but finding one that is simple enough would be the challenge.

What I do is to close view handle(s) and FileMapping handle, set the file size then reopen mapping / view handles.
// Open memory mapped file
HANDLE FileHandle = ::CreateFileW(file_name, GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, 0, NULL);
size_t Size = ::GetFileSize(FileHandle, 0);
HANDLE MappingHandle = ::CreateFileMapping(FileHandle, NULL, PAGE_READWRITE, 0, Size, NULL);
void* ViewHandle = ::MapViewOfFile(MappingHandle, FILE_MAP_ALL_ACCESS, 0, 0, Size);
...
// increase size of file
UnmapViewOfFile(ViewHandle);
CloseHandle(MappingHandle);
Size += 1024;
LARGE_INTEGER offset;
offset.QuadPart = Size;
LARGE_INTEGER newpos;
SetFilePointerEx(FileHandle, offset, &newpos, FILE_BEGIN);
SetEndOfFile(FileHandle);
MappingHandle = ::CreateFileMapping(FileHandle, NULL, PAGE_READWRITE, 0, Size, NULL);
ViewHandle = ::MapViewOfFile(MappingHandle, FILE_MAP_ALL_ACCESS, 0, 0, Size);
The above code has no error checking and does not handle 64bit sizes, but that's not hard to fix.

There's no easy answer for this problem -- I've looked for one for a long time, in vain. You'll have to modify the file's size, then re-map it.

Mapping has a basic issue with file on remote system.
In good old DOS days, there a was a fine editor called Norton Editor ( ne.com .. this the
filename, not web site ). It can load file of any size ( we are talking of 640kb RAM
and 20 GB hard disks, if any ).
It used to load only part of file, cleverly managing file-long searches with on demand
loading
IMHO, such an approach should be used.
If properly hidden under a file-read-write layer , it can be surprisingly transparent.

I'd build the large file from pieces at build-time. You have your editor deal with normal, flat files, in the usual file system (with subdirectories, etc., as appropriate). You then have a compile step that gathers all of these pieces together into your archive file format.

Related

Is it more efficient to rewind a file than closing it and opening it up again?

I'm writing a little C++ program for myself. At the begining of it, I read a file all the way to the bottom, and later on, right before the program ends, I need to read that file again from the begining.
My question is, is it more efficient to have the file open during the execution (even thought I won't be using it) and just rewind it when I need it again, or should I close it the first time and then open it again when I need it?
Edit: Just to clarify, my question is not only related to the specific project that I'm working on. It is really small (less than 300 lines of code), so there won't be any noticeable performance difference. I'm asking about opening, closing and "rewinding" files in general, so it's aplicable to other big projects were performance and memory may actually matter
If you close and open the file, the OS definitely need to update system lock for the file and list of resources (opened files) of your process. Furthermore close and open operation are two systems calls (kernel calls) and system call is not cheap. Every system call require translating of virtual address.
Closing the file can (if there is any change) force writing the cache to the hard-disk, this means seek time about 15ms (physical move of the platter). It can be even worse in the case of network drive.
After closing the file, some properties need to be updated. FileSystem watcher may be launched.
An antivirus scanning may be triggered after closing the file, it depends on filename, path, antivirus brand.
Furthermore closing the file is a risk, that you are not able to open it again because of another process. For example Dropbox read every file in Dropbox folder after change. So closing and opening file does not generally work in Dropbox folder (Dropbox may be faster). And who knows how users use your application. Users are inventive and they share files you didn't think of.
You might be able to measure a fraction of gained efficiency in the range of a few nanoseconds if you fseek to the beginning of the file but I don't think this is worth it when you are only dealing with a single file.
Like others said: try to find other areas of code which you can optimize.
As with all performance issues, the final optimizations vary widely. Measure both implementations against a reasonable data set and take it from there.
As a design choice it may be simpler to cache the contents of the file in memory once it has been read the first time and then there is no need to re-read the contents. If the modified content is required then again, cache the modified data to forgo the second read.

Overwriting a file without the risk of a corrupt file

So often my applications want to save files to load again later. Having recently got unlucky with a crash, I want to write the operation in such a way that I am guaranteed to either have the new data, or the original data, but no a corrupted mess.
My first idea was to do something along the lines of (to save a file called example.dat):
Come up with a unique file name for the target directory, e.g. example.dat.tmp
Create that file and write my data to it.
Delete the original file (example.dat)
Rename ("Move") the temp file to where the original was (example.dat.tmp -> example.dat).
Then at load time the application can follow the following rules:
If no "example.dat" and no "example.dat.tmp", first run / new project, so load in the defaults / create new file.
If "example.dat" and no "example.dat.tmp", then load example.dat (normal load case)
If "example.dat.tmp" exists offer the user the chance to potentially recover data. If "example.dat" also exists, do not overwrite it without explicit user constant.
However, having done a little research, I found that as well as OS caching which I may be able to override with the file flush methods, some disk drives still then cache internally and may even lie to the OS saying they are done, so 4. could complete, the write is not actually written, and if the system goes down I have lost my data...
I am not sure the disk problem is actually solvable by an application, but are the general rules above the correct thing to do? Should I keep an old recovery copy of the file for longer to be sure, what are the guidelines regarding such things (e.g. acceptable disk usage, should the user choose, where to put such files, etc.).
Also how should I avoid potential conflict the user and other programs for "example.dat.tmp". I recall seeing a "~example.dat" sometimes from some other software, is that a better convention?
If the disk drives report back to the OS that the data is
physically on the disk, and it's not, then there's not much you
can do about it. A lot of disks do cache a certain number of
writes, and report them done, but such disks should have
a battery backup, and finish the physical writes no matter what
(and they won't loose data in case of a system crash, since they
won't even see it).
For the rest, you say you've done some research, so you no doubt
know that you can't use std::ofstream (nor FILE*) for this;
you have to do the actual writes at the system level, and open
the files with special attributes for them to ensure full
synchronization. Otherwise, the operations can stick around in
the OS buffering for a while. And that as far as I know,
there's no way of ensuring such synchronization for a rename.
(But I'm not sure that it's necessary, if you always keep two
versions: my usual convention in such cases is to write to
a file "example.dat.new", then when I'm done writing, delete
any file named "example.dat.bak", rename "example.dat" to
"example.dat.bak", and then rename "example.dat.new" to
"example.dat". Given this, you should be able to figure out
what did or did not happen, and find the correct file
(interactively, if need be, or insert an initial line with the
timestamp).
You should lock the actual data file while you write its substitute, if there's a chance that a different process could be going through the same protocol that you are describing.
You can use flock for the file lock.
As for your temp file name, you could make your process ID part of it, for instance "example.dat.3124," No other simultaneously-running process would generate the same name.

How to calculate time to load a file into a application thorugh C++?

I have written a code in C++ to open a file in its default application like .doc in MS-Word now I want to calculate time to open a file into its application.
For that I need to know percentage of file loaded into that application. But from last 7 days I couldn't find any suitable solution. So can any one help me in solving this problem?
If i am using windows then can windows task manager help me to do this?
What you're trying to do is not only impossible, it doesn't even make sense.
When you play an MP3 in WMP, it doesn't load the whole file into memory. Instead, it maps a little bit of the file at a time into memory so it can decode the MP3 on the fly as it's playing. (I suppose if you play the song all the way through, without stopping or skipping or fast forwarding or rewinding, it will eventually read every byte of the file, probably finishing a few seconds before the song is over, but I doubt that's what you're looking for.)
Likewise, Word doesn't read any entire .doc file into memory (unless it's very small). That's how it's able to edit gigantic files without using huge amounts of memory. (Again, if you page through the whole file, it will probably eventually read every byte—for that matter, it may eventually copy enough of the file into an autosave backup file that it no longer needs to look at the original—but again, I doubt that's what you're looking for.)
If you only care about certain specific applications, and those applications have a COM Automation interface (as both WMP and Word do), they may have methods or events that will tell you when they're done "loading" a file (meaning they've read enough of it to start playing/displaying/etc.), or when they've "finished" with a file (meaning moved on to the next track, or whatever), but there's no generic answer to that; different applications will have different Automation interfaces. (And, as a side note, you really don't want to do COM Automation from C++ unless you really have to; it's much easier from jscript, vbscript, or your favorite .NET language…)
If the third party process does not signal that it has loaded something, e.g., through some output stream, one way will be to view the file handles being opened and closed by the processes. I presume this will be similar to how "task managers" like Process Explorer are able to view file handles of processes. However, if the process does not close the file handle once it is done "loading", then, you will not get an accurate time. Furthermore, you won't be able to get a "live" percentage of how much data has been loaded.

How to create virtual file system that file path can be accessed same as disk

I need to create FileSystem type of thing in memory or on disk, which can be accessed same as file on disk, which path is can be used in function like fopen(),etc.
Details:
I am using AddFontResourceEx function to load font in application. Since this function require file path so that file need to present on disk. But I've requirement, that the user cannot access/see the font file.
I tried AddFontMemResourceEx function, but the loaded font is not enumable so that user cannot see the font in the application. Also I tried with some library which create VFS, but they work like database, i.e you can create file/directory and access them. But cannot use their file path in AddFontResourceEx or any other function.
Is there exist some way by which I can create a Virtual FileSystem in memory or on disk which can be accessible through my application and I can write/read file on this virtual filesystem created and it's file path can be used by AddFontResourceEx function.
It can't really work. Yes, you can add a "virtual" file system. But either it's visible to user X or it isn't. Access Control on Windows works on a per-user base, not a per-program base. So, if user X can see the font in application A, he can also see it in application B - even if B is Explorer.EXE.
If the user is an administrator, you can't really prevent them from seeing the font file if they're determined enough. They could, for example, reverse engineer your program to figure out how you're generating the file and repeat the process by hand to make their own copy. Or (even if you could somehow tie the file permissions to your process) they could insert their own code into your process to retrieve the file, or to retrieve the font information directly from memory.
If it's good enough to make it difficult for them to see the font file, you could try this:
Create a directory in the temp folder, with write-only permission for the current user and no permissions for anyone else.
Create a sub-directory with a long, complex, cryptographically random name, and with full permission for the current user. (The name should be different each time.)
Write the font file to the sub-directory and load it.
Delete the font file and remove both directories.
The entire process should take only a fraction of a second, which should make it somewhat difficult for the user to override the permissions and retrieve the file. If they use a debugger to single-step through the program then I guess you're out of luck, but as I already pointed out, nothing's going to stop everyone.
Another option, presumably, would be to just use AddFontMemResourceEx and put up with the fact that the font isn't then enumerable. You'd just need to change your code so that wherever it enumerates fonts it adds your font(s) to the list manually.
If you didn't get the right answer, maybe you didn't ask the right question
Your post title mentions "virtual filesystem", but. later, you mention "accesing a font".
"Virtual Filesystems" its an ambiguos term used in several ways.
One common case, means adding devices or networks to an O.S.
In your case, seems like accesing from a an application.
There are several ways ( "libraries" ) to emulate or work with a filesystem.
Some of them work independent of the real filesystem. You work with them, save data in those "virtual" folders & files, and copy data from the real and the virtual one.
Some of them work, as a extension layer, between the real filesystem, and the programming filesystem.
Example: I worked with an application, that required temporally fast I.O. access. Found a library, that when you want to create a folder or save a file in the real filesystem, was done.
Additionally, I could add "virtual drives" that where stored in memory, but, accessed with file system operations. When the application finished, the "hard drives" and their data where erased from memory.
Its seems that your case is similar to my example.
What do you want a "virtual filesystem" library for ?
I have seen onb the web, several libraries, for C++, open source, freeware, and commercial.
It depends what do you want to do, to find out, which library its the better for your case.
Good Luck

fopen: is it good idea to leave open, or use buffer?

So I have many log files that I need to write to. They are created when program begins, and they save to file when program closes.
I was wondering if it is better to do:
fopen() at start of program, then close the files when program ends - I would just write to the files when needed. Will anything (such as other file io) be slowed down with these files being still "open" ?
OR
I save what needs to be written into a buffer, and then open file, write from buffer, close file when program ends. I imagine this would be faster?
Well, fopen(3) + fwrite(3) + fclose(3) is a buffered I/O package, so another layer of buffering on top of it might just slow things down.
In any case, go for a simple and correct program. If it seems to run slowly, profile it, and then optimize based on evidence and not guesses.
Short answer:
Big number of opened files shouldn't slow down anything
Writing to file will be buffered anyway
So you can leave those files opened, but do not forget to check the limit of opened files in your OS.
Part of the point of log files is being able to figure out what happened when/if your program runs into a problem. Quite a few people also do log file analysis in (near) real-time. Your second scenario doesn't work for either of these.
I'd start with the first approach, but with a high-enough level interface that you could switch to the second if you really needed to. I wouldn't view that switch as a major benefit of the high-level interface though -- the real benefit would normally be keeping the rest of the code a bit cleaner.
There is no good reason to buffer log messages in your program and write them out on exit. Simply write them as they're generated using fprintf. The stdio system will take care of the buffering for you. Of course this means opening the file (with fopen) from the beginning and keeping it open.
For log files, you will probably want a functional interface that flushes the data to disk after each complete message, so that if the program crashes (it has been known to happen), the log information is safe. Leaving stuff in standard I/O buffers means excavating the data from a core dump - which is less satisfactory than having the information on disk safely.
Other I/O really won't be affected by holding one - or even a few - log files open. You lose a few file descriptors, perhaps, but that is not often a serious problem. When it is a problem, you use one file descriptor for one log file - and you keep it open so you can log information. You might elect to map stderr to the log file, leaving that as the file descriptor that's in use.
It's been mentioned that the FILE* returned by fopen is already buffered. For logging, you should probably also look into using the setbuf() or setvbuf() functions to change the buffering behavior of the FILE*.
In particular, you might want to set the buffering mode to line-at-a-time, so the log file is flushed automatically after each line is written. You can also specify the size of the buffer to use.