I have the following case. I have a big file say 1 KB. I want to read the first 100 bytes and then delete the 100 bytes read data from the file and then read next 100 bytes. To read 100 bytes is ok, but how do I delete 100 bytes from the file?
This is commonly done as a multiple-step process:
Rename the original file.
Write the data you want into a new file with the original file name.
Delete the old file with the temporary name that contains the data you no longer want.
That way, if something were to go wrong, you could simply restore the original file that you renamed. Moving a file from one place to another is implemented this way, as well.
However, if you don't want to do this, the SetEndOfFile function is another viable option to truncate the contents of a file in-place. From the documentation:
Sets the physical file size for the specified file to the current position of the file pointer.
The physical file size is also referred to as the end of the file. The SetEndOfFile function can be used to truncate or extend a file.
That wouldn't be called truncating; that term refers to removing data from the end, not the beginning. I'm not aware of any operating system where this is possible, other than by copying the contents of the file to a new file, starting at the 100th byte.
Deleting data that has been processed in a file is time consuming and in most cases not necessary.
Deleting data near the top or middle of the file requires writing a new file, which takes time and disk space. Most applications will read and process the entire file then rename the file (with a backup extension). This is useful for debugging purposes. Deleting an entire file is often a faster operation that writing a new file without processed data.
Deletions should only take place when necessary. For files, one can store an offset of where the valid data begins, thus reducing the need to delete data from a file. For secure purposes, overwriting data in the file is often faster then creating a new file without the processed data.
First try writing your program to not delete data in the file. Only delete as necessary, after the program is robust and working correctly. Many people would suggest to only delete files when there is no more space on drive.
Related
I have a working integration of FatFS in my C++ application running on a Cortex M4-based platform.
My application consists of logging data to a data format called MDF.
On the implementation side, I log data (to a given file) in batches of buffers; The number of buffers depends on how fast I acquire the data: log batch of one buffer . . . do other stuff . . . log batch of five buffer . . . do other stuff . . . etc.
There is also a header which is 24 bytes and contains the number of bytes of data. On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
This means that I have to call f_lseek before writing the header and then before I write the batch of data.
I am using f_cache_fptr so f_lseek is not painfully slow but I'd like to avoid needing to call f_lseek so frequently.
QUESTION
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
I am open to modifying FatFS.
The problem, at the low-level, is simpler because the header only shares one 512 byte sector with the data: 24 bytes of header followed by 488 bytes of data.
Is it possible to somehow have 2 seek locations so that I don't need to call f_seek to ping-pong between header-location and data-location?
Not as far as I can tell, no, and it doesn't really seem to make sense. A FIL has only one current position, indicating where the next data written to it will go. What would it even mean for there to be two? How would the system know where to write? It certainly wouldn't be correct to write to both places.
Note in particular that with some operating systems and file systems, it is possible to open the same file more than once, but FatFS supports duplicate file opens only when all openings involved are for read-only mode.
I guess it would be possible to modify FatFS to give it the ability to store one file position when you seek to another, and then later to return to the first. So that would mean adding at least one member to the FIL structure, and adding at least one new function.
But why muck with the innards of FatFS? That's going to be at least a little risky. As long as you have to add a function anyway, how about just implementing a FRESULT my_f_write_at_beginning(FIL* fp, const void* buff, UINT btw, UINT* bw) on top of the existing functions? It can store the current position, seek to the beginning of the file, perform the write (maybe ensuring that the full number of bytes specified is written), and then seek back to the original position.
But fundamentally, no, there is no escaping ping-ponging back and forth, because doing so is part of the requirement you laid out.
On a PC, I would just save the header at the end of the measurement but this is an embedded product which could be de-powered at any point in time. If I don't save the header periodically, the file becomes "corrupted".
Therefore, in order to maintain coherency I need to re-save the header after saving every batch of data and that's where my issue is.
More correctly; you need to save the buffer and the header (footer?), and update the directory entry to reflect the new file size, and update the file allocation table to account for sectors allocated; and you need to write to at least 3 completely separate sectors "atomically" so that everything is consistent if there's a power failure at the wrong time.
This isn't entirely possible on most hardware.
However, there is a way to do it "somewhat safely". Specifically:
pre-allocate enough clusters for a completely new copy of the file (including the new data to append to the end) and update the file allocation table accordingly. If there's a power failure while doing this (or immediately after this point) the risk is lost clusters, which is an "ignore-able" problem that will waste some space but can be fixed easily with a typical "check disk" utility.
create a whole new copy of the file's data in the pre-allocated clusters (copy the old data, then append the new data and header). If there's a power failure in the middle of doing this (or immediately after this point), then the risk is the same as before - just some lost clusters (ignore-able).
atomically update the directory entry; changing both the file size and the "starting cluster number" with the same atomic (single sector) write. If there's a power failure after this point the risk is the same lost clusters (where the old version of the file's data was instead of where the new version of the file data is).
free the clusters that the old version of the file used by doing writes to the file allocation table. After this point you've completed successfully, so a power failure is fine.
To make this less awful for performance you can have two "cluster chains" and alternate between them; such that one chain of clusters is for the current version of the file and the other will become the next version of the file. This avoids the need to copy a lot of older data from one place to another (if you know the old data is still in previously used clusters). It could also avoid the need to allocate and free most clusters in the file allocation table, but only with a significant increase in the risk of lost clusters.
Of course for any of this to work you'd need a guarantee that single-sector writes are atomic; and you can't be using FAT12 (where an entry in the file allocation table can be split by a sector boundary).
I have a file written using gzwrite. Now i want to edit this file and insert some data in the middle by seeking. Is this possible with gzseek/gzwrite in cpp?
No, it isn't possible. You have to create a new file by successively writing the pieces.
So it is not much different from inserting data in the middle of an uncompressed file, except for one thing: with the uncompressed file, you could leave a hole of the right size (a series of spaces, for example) and later on overwrite that with the data to be inserted, but of course that is not possible with the compressed file because you cannot predict its compressed length.
I have an embedded Linux system, that stores data in a very large file, appending new data to the end. As the file size grows near filling available storage space, I need to remove oldest data.
Problem is, I can't really accept the disruption it would take to move the massive bulk of data "up" the file, like normal - lock the file for an extended period of time just to rewrite it (plus this being a flash medium, it would cause unnecessary wear to the flash).
Probably the easiest way would be to split the file into multiple smaller ones, but this has several downsides related to how the data is handled and processed - all the 'client end' software expects single file. OTOH it can handle 'corruption' of having the first record cut in half, so the file doesn't need to be trimmed at record offsets, just 'somewhere up there', e.g. first few iNodes freed. Oldest data is obsolete anyway so even more severe corruption of the beginning of the file is completely acceptable, as long as the 'tail' remains clean, and liberties can be taken with how much exactly is removed - 'roughly several first megabytes' is okay, no need for 'first 4096KB exactly' precision.
Is there some method, API, trick, hack to truncate beginning of file like that?
You can achieve the goal with Linux kernel v3.15 above for ext4/xfs file system.
int ret = fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, 4096);
See here
Truncating the first 100MB of a file in linux
The easiest solution for your old applications would be a FUSE filesystem which gives them access to the underlying file, but with the offset cyclically shifted. This would allow you to implement a ringbuffer at the physical level. The FUSE layer would be fairly trivial as it only needs to adjust all filepositions by a constant, modulo filesize.
What about setting up a separate process that renames the output file when it reaches a predefined size (for instance by adding the linux time at the end of the file name).
This would allow you to keep the old data and the main process will recreate the output file the next time it writes to it.
Another cron job may remove the old file every now and then.
I'm writing a simple calendar application which saves the data in a text file. I use the iCalendar-format, so my text file ends "END:VCALENDAR".
When the user adds a new event, the application should write the associated data at the end of the text file without overwriting "END:VCALENDAR", how can I do this? What about deleting an event which is saved in the middle of the text file? Is there a need to write the whole file again using the updated data? Many thanks.
You can't dynamically "expand" the file by writing in the middle of it.
You'll need to, either:
Deserialize the whole calendar to memory, then write it back (best option)
Read into memory everything which lies past the point you want to insert the data, write you data, then write the stored file "tail"
There isn't any way of inserting into the middle of a file; the underlying OS doesn't support it. The usual technique is to copy the file into a temporary file, making whatever modifications you need to along the line, then (and only if there are no errors on the output of the copy—do verify that the output stream has not failed after the close) delete the input file and rename/move the temporary file to the original name.
There is no method supported by the C++ libraries that, unlike append, gives an option to insert at any specific position into a file; be it a text or a binary file.
There are two options for you then:
First is the one you are presuming, that is, read the whole file, update the data and write it back again.
Second is to seek in the file to the last line's first character E as in END:VCALENDAR, write your event and then append "END:VCALENDAR" to it.
And yes, you can find that first character of last line, E right after the last newline character, programmatically.
Sorry, there isn't really any other way around, as far as I know.
I'm working on a simple database console application in C++ for adding, editing and deleting records in a .dat file. I have the addition and modification down, I'm just finding it hard to understand the concept of deletion in this scenario. Below is how I write a record.
Write record
fh.seekp(num*sizeof(customerObj),ios::beg); // Move the write pointer to where rec is
fh.write((char*)&customerObj,sizeof(customerObj)); // Write updated rec
Any ideas how instead of write() I could have something equivalent to delete()... or is it not that simple?
C and C++ don't have functions to delete parts of files. Many operating systems don't either.
Possible options:
If this is the last record, truncate the file. If not, move (=copy) all records after it, overwriting it, then truncate. Alternatively you could move (=copy) the last record to it and then truncate.
Create an extra file and copy to it all records before this and after this record. Then delete the old file and rename the new file.
Mark the record as unused. When writing new records check if you have any unused locations and use them first.
Use a file per record.
Introduce marker of deleted record. So instead of movement of large chunks of file you need write 1 symbol. When you need allocate new record you could iterate over already deleted and just remove marker.
Well, deleting from files is not trivial as you cannot delete a row from a file directly.
One approach is to read the entire file (or in bulks) and write it back without the required line. (quite robust and not efficient for large files).
Maybe if you divide you record file into smaller partitioned files than doing the above will be more efficient.
Another thing you can do is just mark a row in the file as invalid (as done in memory when deleting a pointer) and overwrite it when needed which of course depends on how you write your records but I hope you get what I mean.