C++ fstream Erase the file contents from a selected Point - c++

I need to Erase the file contents from a selected Point (C++ fstream) which function should i use ?
i have written objects , i need to delete these objects in middle of the file

C++ has no standard mechanism to truncate a file at a given point. You either have to recreate the file (open with ios::trunc and write the contents you want to keep) or use OS-specific API calls (SetEndOfFile on Windows, truncate or ftruncate on Unix).
EDIT: Deleting stuff in the middle of a file is an exceedingly precarious business. Long before considering any other alternatives, I would try to use a server-less database engine like SQLite to store serialised objects. Better still, I would use SQLite as intended by storing the data needed by those objects in a proper schema.
EDIT 2: If the problem statement requires raw file access...
As a general rule, you don't delete data from the middle of a file. If the objects can be serialised to a fixed size on disk, you can work with them as records, and rather than trying to delete data, you use a table that indexes records within the file. E.g., if you write four records in sequence, the table will hold [0, 1, 2, 3]. In order to delete the second record, you simply remove its entry from the table: [0, 2, 3]. There are at least two ways to reuse the holes left behind by the table:
On each insertion, scan for the first unused index and write the object out at the corresponding record location. This will become more expensive, though, as the file grows.
Maintain a free list. Store, as a separate variable, the index of the most recently freed record. In the space occupied by that record encode the index of the record freed before it, and so on. This maintains a handy linked-list of free records while only requiring space fo one additional number. It is more complicated to work with, however, and requires an extra disk I/O when deleting and inserting.
If the objects can't be serialised to a fixed-length, then this approach becomes much, much harder. Variable-length record management code is very complex.
Finally, if the problem statement requires keeping records in order on disk, then it's a stupid problem statement, because insertion/removal in the middle of a file is ridiculously expensive; no sane design would require this.

The general method is to open the file for read access, open a new file for write access, read the content of the first file and write the data you want retained to the second file. When complete, you delete the first file and rename the second to that of the first.

Related

Recover object lazy-loading the containing file

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;
Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.
You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

Why must files be entirely rewritten when there are arrays in C++?

From what I've read on the internet and in books, when editing a text file, or any file for that matter, the file must be entirely rewritten; You can't just insert data into a file and save it.
If this is so, how can there be arrays in programming languages? In C++ I can explicitly modify certain values in the middle of arrays. To me, this seems like a demonstration of the modification of one or two bytes in the middle of a group of bytes.
The only two possible solutions I have thought of are
There is some funky stuff going on behind-the-scenes in C++, so it seems like only one or two bytes are being modified, but the array is actually entirely rewritten.
Then, after thinking about it, and especially after typing it out, I realized the the aformentioned solution seems really really dumb and totally not true, because there are things like addresses and pointers and the performance sounds awful. So I thought that maybe files are not entirely rewritten; only the everything after the first point in new data was inserted is rewritten. This seems much more plausible to me, and makes sense.
What is the difference between writing data to a file and writing data to an array?
You can change the values of particular locations in an array without needing to re-write the whole thing. However, you cannot insert new values into the middle of an array without shifting everything following the new values in order to make room.
Similarly, you can overwrite sections of a file without copying it (although the underlying storage technology may need to re-write an entire storage block in order to change a single byte within that block), but you can't insert new data without somehow making room for it. Text editors (and editors for more complicated file formats) are designed for random-access modifications that do not preserve length, so they will typically re-write the entire file regardless of what changed rather than trying to optimize for length-preserving edits.

Remove bytes in the middle of a file without moving the end?

For example if i have lots of data entry's stored in a file, each with different sizes, and i have 1000 entries which makes the file like 100MB large, if i then wanted to remove an entry in the middle of the file which is size of 50KB, how can i remove that empty 50KB of bytes in the file without moving all the end bytes up to fill it?
I am using winapi functions such as these for the file management:
CreateFile, WriteFile, ReadFile and SetFilePointerEx
If you really want to do that, set a flag in your entry. When you want to remove an entry from your file, simply invalidate that flag(logical removal) w/o deleting it physically. The next time you add an entry, just go through the file, look for the first invalidated entry, and overwrite it. If all are validated, append it to the end. This takes O(1) time to remove an entry and O(n) to add a new entry, assuming that reading/writing a single entry from/to disk is the basic operation.
You can even optimize it further. At the beginning of the file, store a bit map(1 for invalidated). E.g., 0001000... represents that the 4th entry in your file is invalidated. When you add an entry, search for the first 1 in the bit map and use Random file I/O (in contrast with sequential file I/O) to redirect the file pointer to that entry-to-overwrite directly. Adding in this way only takes O(1) time.
Oh, I notice your comment. If you want to do it efficiently with entry removed physically, a simple way is to swap the entry-to-remove with the very last one in your file and remove the last one, assuming your entries are not sorted. The time is also good, which is O(1) for both adding and removing.
Edit: Just as Joe mentioned, this requires that all of your entries have the same size. You can implement one with variable length of entries, but that'll be more complicated than the one in discussion here.
Let A = start of file, B = start of block to remove, C = end of block to remove
CreateFile with flag FILE_FLAG_RANDOM_ACCESS
SetFilePointerEx to position C, read to EOF into buffer (this may be a large buffer given your file size. Be careful with gigantic records, because any File IO operation now has to allocate virtual memory of the record size to do any simple operation such as move).
Copy buffer to position B in file
Should now be at position B + sizeof(block C). Call SetEndOfFile to truncate the file at that position, then close.
Note that this could be done way easier with the memmove function. However this requires you to map the entire file into memory, make the move, and write it back out. This is great for small files, but files larger than 50-100MB I would caution you about having enough available contiguous virtual address space.
You can simply keep flagging the unused space, and after some time when the internal fragmentation exceeds a certain ratio then you can run a routine which will compact the file. With this scheme the removals would be fast, but some periodic reorganization is needed. If you have a separate file handling scheme, then you can divide the file in some chunks and then keep track of the free chunks and when deleting mark the chunk as unused and keep track of it, and later in the case of an insertion reuse it. This scheme will depend on the type of records in your file, fixed or variable length records.

Delete a line in an ofstream in C++

I want to erase lines within a file. I know you can store the content of the file (in a vector for example), erase the line and write again. However, it feels very cumbersome, and not very efficient if the file gets bigger.
Anyone knows of a better, more efficient, more elegant way of doing it?
On most file-systems, this is the only option you have, short of switching to an actual database.
However, if you find yourself in this situation (i.e. very large files, with inserts/deletes in the middle), consider whether you can do something like maintaining a bitmap at the top of the file, where each bit represents one line of your file. To "delete" a line, simply flip the corresponding bit value.
There's nothing particularly magical about disk files. They still like to store their data in contiguous areas (typically called something like "blocks"). They don't have ways of leaving data-free holes in the middle of these areas. So if you want to "remove" three bytes from the middle of one of these areas, something somewhere is going to have to accomplish this by moving everything else in that area back by three bytes. No, it is not efficient.
This is why text editors (which have to do this kind of thing a lot) tend to load as much of the file as possible (if not all of it) into RAM, where moving data around is much faster. They typically only write changes back to disk when requested (or periodically). If you are going to have to make lots of changes like this, I'd suggest taking a page from their book and doing something similar.
The BerkeleyDB (dbopen(3)) has an access method called DB_RECNO. This allows one to manipulate files with arbitrary lengths using any sort of record delimiter. The default uses variable-length records with unix newlines as delimiters. You then access each "record" using an integer index. Using this, you can delete arbitrary lines from your text file. This isn't specific to C++, but if you are on most Unix/Linux systems, this API is already available to you.

How to store a hash table in a file?

How can I store a hash table with separate chaining in a file on disk?
Generating the data stored in the hash table at runtime is expensive, it would be faster to just load the HT from disk...if only I can figure out how to do it.
Edit:
The lookups are done with the HT loaded in memory. I need to find a way to store the hashtable (in memory) to a file in some binary format. So that next time when the program runs it can just load the HT off disk into RAM.
I am using C++.
What language are you using? The common method is to do some sort binary serialization.
Ok, I see you have edited to add the language. For C++ there a few options. I believe the Boost serialization mechanism is pretty good. In addition, the page for Boost's serialization library also describes alternatives. Here is the link:
http://www.boost.org/doc/libs/1_37_0/libs/serialization/doc/index.html
Assuming C/C++: Use array indexes and fixed size structs instead of pointers and variable length allocations. You should be able to directly write() the data structures to file for later read()ing.
For anything higher-level: A lot of higher language APIs have serialization facilities. Java and Qt/C++ both have methods that sprint immediately to mind, so I know others do as well.
You could just write the entire data structure directly to disk by using serialization (e.g. in Java). However, you might be forced to read the entire object back into memory in order to access its elements. If this is not practical, then you could consider using a random access file to store the elements of the hash table. Instead of using a pointer to represent the next element in the chain, you would just use the byte position in the file.
Ditch the pointers for indices.
This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:
Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
When chaining is needed use the free space in the table.
If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
record the index (let's call it N)
swap the new element and the squatter
put the squatter in a new free index, (F).
follow the chain on the squatter's hash index, to replace N with F.
If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.
This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.
Perhaps DBM could be of use to you.
If your hash table implementation is any good, then just store the hash and each object's data - putting an object into the table shouldn't be expensive given the hash, and not serialising the table or chain directly lets you vary the exact implementation between save and load.