I'm writing 3D model data out to file, while includes a lot of different types of information (meshes, textures, animation, etc) and would be about 50 to 100 mb in size.
I want to put all this in a single file, but I'm afraid it will cost me if I need to read only a small portion of that file to get what I want.
Should I be using multiple smaller files for this, or is a single very large file okay? I don't know how the filesystem treats trying to jump around giant files, so for all I know iterating through a large file may either be costly, or no problem at all.
Also, is there anything special I must do if using a single large file?
There is no issue with accessing data in the middle of a file - the operating system won't need to read the entire file, it can skip to any point easily. Where the complexity comes in is you'll need to provide an index that can be read to identify where the various pieces of data are.
For example, if you want to read a particular animation, you'll need a way to tell your program where this data is in the file. One way would be to store an index structure at the beginning of the file, which your program would read to find out where all of the pieces of data are. It could then look up the animation in this index, discover that it's at position 24680 and is 2048 bytes long, and it could then seek to this position to read the data.
You might want to look up the fseek call if you're not familiar with seeking within a file: http://www.cplusplus.com/reference/cstdio/fseek/
I want to split a big file into smaller ones without copying part of file, and without using filestream or functions which use it (if it is possible).
Imagine, we have big file which is consisted of 3 files:
[[File1bytes][File2bytes][File3bytes]]
In my opinion we can do this with these steps:
Use SetEndOfFile function to truncate the bytes of the last file ([File3bytes] in our example)
Somehow force our file system to recognize those truncated bytes ([File3bytes]) as a real file (maybe by adding some info to MFT table, or doing something with NTFS if it is possible, or using some function or method which can do all mentioned for us).
Any suggestions?
How about create a file system nesting over the existing file system where the very large file actually resides and define some IOCTL commands for splitting? Check this link:
How can I write my own 'filesystem' within Windows?
I maintain an application that collects data from a datalogger and appends that data to the end of a binary file. The nature of this system is that the file can grow large (> 4 gigabytes) small steps at a time. On of the users of my application has seen cases on his NTFS partition where the attempts to append data fail. The error is being reported as a result of a call to fflush(). When this happens, the return value for GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). MSDN gives the following description for this error
The requested operation could not be completed due to a file system limitation
A search for this error code on google gives results related to SQL server with VERY large files (tens of gigabytes) but, at present, our file is much smaller. This user has not been able to get the file to grow beyond 10 gigabytes. We can temporarily correct the situation when we do some operation (like copying the file) that forces some sort of rewrite in the file system. Unfortunately, I am not sure what is going on to put us in this condition in the first place. What specific conditions in an NTFS file system can lead to this particular error being reported on a call to fflush()?
This sounds like you've reached a limit in the fragmentation of the file. In other words, each flush is creating a new extent (fragment) of the file and the filesystem is having a hard time finding a place to keep track of the list of fragments. That would explain why copying the file helps -- it creates a new file with fewer fragments.
Another thing that would probably work is defragmenting the file (using Sysinternals's contig utility you may be able to do so while it's in use). You can also use contig to tell you how many fragments the file has. I'm guessing it's on the order of one million.
If you have to flush the file frequently and can't defrag it, something you can do is simply create the file fairly large in the first place (to allocate the space all at once) and then write to successive bytes of the file rather than append.
If you're brave (and your process has admin access), you can defragment the file yourself with a few API calls: http://msdn.microsoft.com/en-us/library/aa363911(v=VS.85).aspx
In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.
I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.
The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.
I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.
Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.
If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);
I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.
I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.
Is there a O(1) way in windows API to concatenate 2 files?
O(1) with respect to not having to read in the entire second file and write it out to the file you want to append to. So as opposed to O(n) bytes processed.
I think this should be possible at the file system driver level, and I don't think there is a user mode API available for this, but I thought I'd ask.
If the "new file" is only going to be read by your application, then you can get away without actually concatenating them on disk.
You can just implement a stream interface that behaves as if the two files have been concatenated, and then use that stream as opposed to what ever the default filestream implementation used by your app framework is.
If that won't work for you, and you are using windows, you could always create a re parse point and a file system filter. I believe if you create a "mini filter" that it will run in user mode, but I'm not sure.
You can probably find more information about it here:
http://www.microsoft.com/whdc/driver/filterdrv/default.mspx
No, there isn't.
The best you could hope for is O(n), where n is the length of the shorter of the two files.
From a theoretical perspective, this is possible (on-disk) provided that:
the second file is destroyed
the concatenation honours the filesystem's fragment alignment (e.g. occurs on a cluster boundary)