Writing to the middle of the file (without overwriting data) - c++

In windows is it possible through an API to write to the middle of a file without overwriting any data and without having to rewrite everything after that?
If it's possible then I believe it will obviously fragment the file; how many times can I do it before it becomes a serious problem?
If it's not possible what approach/workaround is usually taken? Re-writing everything after the insertion point becomes prohibitive really quickly with big (ie, gigabytes) files.
Note: I can't avoid having to write to the middle. Think of the application as a text editor for huge files where the user types stuff and then saves. I also can't split the files in several smaller ones.

I'm unaware of any way to do this if the interim result you need is a flat file that can be used by other applications other than the editor. If you want a flat file to be produced, you will have to update it from the change point to the end of file, since it's really just a sequential file.
But the italics are there for good reason. If you can control the file format, you have some options. Some versions of MS Word had a quick-save feature where they didn't rewrite the entire document, rather they appended a delta record to the end of the file. Then, when re-reading the file, it applied all the deltas in order so that what you ended up with was the right file. This obviously won't work if the saved file has to be usable immediately to another application that doesn't understand the file format.
What I'm proposing there is to not store the file as text. Use an intermediate form that you can efficiently edit and save, then have a step which converts that to a usable text file infrequently (e.g., on editor exit). That way, the user can save as much as they want but the time-expensive operation won't have as much of an impact.
Beyond that, there are some other possibilities.
Memory-mapping (rather than loading) the file may provide efficiences which would speed things up. You'd probably still have to rewrite to the end of the file but it would be happening at a lower level in the OS.
If the primary reason you want fast save is to start letting the user keep working (rather than having the file available to another application), you could farm the save operation out to a separate thread and return control to the user immediately. Then you would need synchronisation between the two threads to prevent the user modifying data yet to be saved to disk.

The realistic answer is no. Your only real choices are to rewrite from the point of the modification, or build a more complex format that uses something like an index to tell how to arrange records into their intended order.
From a purely theoretical viewpoint, you could sort of do it under just the right circumstances. Using FAT (for example, but most other file systems have at least some degree of similarity) you could go in and directly manipulate the FAT. The FAT is basically a linked list of clusters that make up a file. You could modify that linked list to add a new cluster in the middle of a file, and then write your new data to that cluster you added.
Please note that I said purely theoretical. Doing this kind of manipulation under a complete unprotected system like MS-DOS would have been difficult but bordering on reasonable. With most newer systems, doing the modification at all would generally be pretty difficult. Most modern file systems are also (considerably) more complex than FAT, which would add further difficulty to the implementation. In theory it's still possible -- in fact, it's now thoroughly insane to even contemplate, where it was once almost reasonable.

I'm not sure about the format of your file but you could make it 'record' based.
Write your data in chunks and give each chunk an id.
Id could be data offset in file.
At the start of the file you could
have a header with a list of ids so
that you can read records in
order.
At the end of 'list of ids' you could point to another location in the file (and id/offset) that stores another list of ids
Something similar to filesystem.
To add new data you append them at the end and update index (add id to the list).
You have to figure out how to handle delete record and update.
If records are of the same size then to delete you can just mark it empty and next time reuse it with appropriate updates to index table.

Probably the most efficient way to do this (if you really want to do it) is to call ReadFileScatter() to read the chunks before and after the insertion point, insert the new data in the middle of the FILE_SEGMENT_ELEMENT[3] list, and call WriteFileGather(). Yes, this involves moving bytes on disk. But you leave the hard parts to the OS.

If using .NET 4 try a memory-mapped file if you have an editor-like application - might jsut be the ticket. Something like this (I didn't type it into VS so not sure if I got the syntax right):
MemoryMappedFile bigFile = MemoryMappedFile.CreateFromFile(
new FileStream(#"C:\bigfile.dat", FileMode.Create),
"BigFileMemMapped",
1024 * 1024,
MemoryMappedFileAccess.ReadWrite);
MemoryMappedViewAccessor view = MemoryMapped.CreateViewAccessor();
int offset = 1000000000;
view.Write<ObjectType>(offset, ref MyObject);

I noted both paxdiablo's answer on dealing with other applications, and Matteo Italia's comment on Installable File Systems. That made me realize there's another non-trivial solution.
Using reparse points, you can create a "virtual" file from a base file plus deltas. Any application unaware of this method will see a continuous range of bytes, as the deltas are applied on the fly by a file system filter. For small deltas (total <16 KB), the delta information can be stored in the reparse point itself; larger deltas can be placed in an alternative data stream. Non-trivial of course.

I know that this question is marked "Windows", but I'll still add my $0.05 and say that on Linux it is possible to both insert or remove a lump of data to/from the middle of a file without either leaving a hole or copying the second half forward/backward:
fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, offset, len)
fallocate(fd, FALLOC_FL_INSERT_RANGE, offset, len)
Again, I know that this probably won't help the OP but I personally landed here searching for a Linix-specific answer. (There is no "Windows" word in the question, so web search engine saw no problem with sending me here.

Related

Parsing log file with multi-line entries

I'm working on parsing a reasonable sized log file (up to 50Mb, at which point it wraps) from a third-party application in order to detect KEY_STRINGs which happened within a specified time frame. A typical entry in this log file may look like this
DEBUG 2013-10-11#14:23:49 [PID] - Product.Version.Module
(Param 1=blahblah Param2=blahblah Param3 =blahblah
Method=funtionname)
String that we usually don't care about but may be KEY_STRING
Entries are separated by a blank line (\r\n at the end of the entry then \r\n before the next entry starts)
This is for a Windows specific implementation so doesn't need to be portable, and can be C/C++/Win32
Reading this line by line would be time consuming but has the benefit of being able to parse the timestamp and check if the entry is within the given timeframe before checking if the any of the KEY_STRINGs are present in the entry. If I read the file by chunks I may find a KEY_STRING but the chunk doesn't have the earlier timestamp, or the chunk border may even be in the middle of the KEY_STRING. Reading the whole file into memory and parsing it isn't an option as the application this is to be a part of currently has a relatively small footprint, so can't justify increasing this by ~10x just for parsing a file (even temporarily). Is there a way I can read the file by delimited chunks (specifically "\r\n\r\n")? Or is there another/better method I've not thought of?
Any help on this will be greatly appreciated!
One possible solution is to use memory-mapped files. I've personally never used them for anything but toy applications, but know some of the theory behind it.
Essentially they provide a way of accessing the contents of files as if they're memory, I believe acting in a similar way to virtual memory, so required parts will be paged-in as required, and paged-out at some point (you should read the documentation to work out the rules behind this).
In pseudocode (because we all like pseudocode), you would do something along these lines:
HANDLE file = CreateFile(...);
HANDLE file_map = CreateFileMapping(file, 0, PAGE_READONLY, 0, 0, ...);
LPVOID mem = MapViewOfFile(file_map, FILE_MAP_READ, 0, 0, 0);
// at this point you can use mem to access data in the mapped part of the file...
// for your code, you would perform parsing as if you'd read the file into RAM.
// when you're done, unmap and close the file:
UnmapViewOfFile(mem);
CloseHandle(file_map);
CloseHandle(file);
I apologise now for not giving advice most excellent, but instead encourage further reading - Windows provides a lot of functionality to handling your memory, and it's mostly worth a read.
Make sure you can't use the memory, perhaps you're being a little bit too "paranoid"? Premature optimization, and all that.
Read it line by line (since that makes it easier to separate entries) but wrap the line-reading with a buffered read, reading as much at a time as you're comfortable with, perhaps 1 MB. That minimizes disk I/O, which is often good for performance.
Assuming that (as would normally be the case) all the entries in the file are in order by time, you should be able to use a variant of a binary search to find the correct start and end points, then parse the data in between.
The basic idea would be to seek to the middle of the file, then read a few lines until you get to one starting with "DEBUG", then read the time-stamp. If it's earlier than the time you care about, seek forward to the 3/4ths mark. If later than the time you care about, seek back to the 1/4th. mark. Repeat the basic idea until you've found the beginning. Then do the same thing for the end time.
Once the amount by which you're seeking drops below a certain threshold (e.g., 64K) it's probably faster to seek to the beginning of the 64K-aligned block, and just keep reading forward from there than to do any more seeking.
Another possibility to consider would be whether you can do some work in the background to build an index of the file as its being modified, then use the index when you actually need a result. The index would (for example) read the time-stamp of each entry right after its written (e.g., using ReadDirectoryChangesW to be told when the log file is modified). It would translate the textual time stamp into, for example, a time_t, then store an entry in the index giving the time_t and the file offset for that entry. This should be small enough (probably under a megabyte for a 50-megabyte log file) that it would be easy to work with it entirely in memory.

Structure for storing data from thousands of files on a mobile device

I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).
Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.

Reading/writing only needed data to/from a large data file to minimize memory footprint

I'm currently brainstorming a financial program that will deal with (over time) fairly large amounts of data. It will be a C++/Qt GUI app.
I figure reading all the data into memory at runtime is out of the question because given enough data, it might hog too much memory.
I'm trying to come up with a way to read into memory only what I need, for example, if I have an account displayed, only the data that is actually being displayed (and anything else that is absolutely necessary). That way the memory footprint could remain small even if the data file is 4gb or so.
I thought about some sort of searching function that would slowly read the file line by line and find a 'tag' or something identifying the specific data I want, and then load that, but considering this could theoretically happen every time there's a gui update that seems like a terrible way to go.
Essentially I want to be able to efficiently locate specific data in a file, read only that into memory, and possibly change it and write it back without reading and writing the whole file every time. I'm not an experienced programmer and my googling for ideas hasn't been very successful.
Edit: I should probably mention I intend to use Qt's fancy QDataStream related classes to store the data. In other words the file will likely be binary and not easily searchable line by line like a text file.
Okay based on your comments.
Start simple. Forget about your fiscal application for now, except as background. So suitable example for your file system
One data type e.g accounts.
Start with fixed width columns giving you a fixed width record.
One file for data
Have another file for the index of account number
Do Insert, Update and Delete, you'll learn a lot.
For instance.
Delete, you could find the index and the data, move them out and rebuild both files.
You could have a an internal field on the account record, that indicated it had been deleted, set that in data, and just remove the index. The latter is also rewrite the entire file though. You could put the delete flag in the index file instead...
When inserting do you want to append, do you want to find a deleted record and reuse that slot?
Is your index just going to be a straight list of accounts and position, or dovyouvwant to hash it, use a tree. You could spend a weeks if not months just looking at indexing strategies alone.
Happy learning anyway. It will be interesting to help with your future questions.

How does large text file viewer work? How to build a large text reader

how does large text file viewer work?
I'm assuming that:
Threading is used to handle the file
The TextBox is updated line by line
Effective memory handling is used
Are these assumptions correct? if someone were to develop their own, what are the mustsand don'ts?
I'm looking to implement one using a DataGrid instead of a TextBox
I'm comfortable with C++ and python. I'll probably use QT/PyQT
EDIT
The files, I have are usually between 1.5 to 2 GB. I'm looking at editing and viewing these files
I believe that the trick is not loading the entire file into memory, but using seek and such to just load the part which is viewed (possibly with a block before and after to handle a bit of scrolling). Perhaps even using memory-mapped buffers, though I have no experience with those.
Do realize that modifying a large file (fast) is different from just viewing it. You might need to copy the gigabytes of data surrounding the edit to a new file, which may be slow.
In Kernighan and Plaugher's classic (antique?) book "Software Tools in Pascal" they cover the development and design choices of a version of ed(1) and note
"A warning: edit is a big
program (excluding contributions from
translit, find, and change; at
950 lines, it is fifty percent bigger
than anything else in this book."
And they (literally) didn't even have string types to use. Since they note that the file to be edited may exist on tape which doesn't support arbitrary writes in the middle, they had to keep an index of line positions in memory and work with a scratch file to store changes, deletions and additions, merging the whole together upon a "save" command. They, like you, were concerned about memory constraining the size of their editable file.
The general structure of this approach is preserved in the GNU ed project, particularly in buffer.c

Is there a O(1) way in windows api to concatenate 2 files?

Is there a O(1) way in windows API to concatenate 2 files?
O(1) with respect to not having to read in the entire second file and write it out to the file you want to append to. So as opposed to O(n) bytes processed.
I think this should be possible at the file system driver level, and I don't think there is a user mode API available for this, but I thought I'd ask.
If the "new file" is only going to be read by your application, then you can get away without actually concatenating them on disk.
You can just implement a stream interface that behaves as if the two files have been concatenated, and then use that stream as opposed to what ever the default filestream implementation used by your app framework is.
If that won't work for you, and you are using windows, you could always create a re parse point and a file system filter. I believe if you create a "mini filter" that it will run in user mode, but I'm not sure.
You can probably find more information about it here:
http://www.microsoft.com/whdc/driver/filterdrv/default.mspx
No, there isn't.
The best you could hope for is O(n), where n is the length of the shorter of the two files.
From a theoretical perspective, this is possible (on-disk) provided that:
the second file is destroyed
the concatenation honours the filesystem's fragment alignment (e.g. occurs on a cluster boundary)