Can you read and delete chunk of a file? - c++

I am using ofstream and ifstream to read a chunk from a file post it over middleware (DDS) to another process and the other process writes that chuck of the file.
Basically transferring a file. The two components are unaware of each other and may live on the same hardware or on different hardware (DDS takes care of the transfer either way).
This is all working, however when I try to do this with a large file (>500MB) and if the destination component is on the same board, then I run out of RAM (since 500 x 2 = 1GB which is my limit).
So I am thinking of reading a chunk from a file deleting that chunk of the file and then sending the chunk. So I end up with:
A B
12345 ->
2345 -> 1
345 -> 12
45 -> 123
5 -> 1234
-> 12345
Where each number is a chunk of a file.
I am using linux, so I can use any linux APIs directly, but would probably prefer a pure c++ approach. I can't really see any good options here. i/ostream does not appear to let you do this. Options like sed will (I think) end up using more memory by copying.
Are there any better mechanisms for doing this?
Update
The files are stored in RAM via a tmpfs partition

I am using linux, so I can use any linux APIs directly, but would
probably prefer a pure c++ approach. I can't really see any good
options here. i/ostream does not appear to let you do this. Options
like sed will (I think) end up using more memory by copying.
Are there any better mechanisms for doing this?
There is no standard mechanism in C++ or Linux for shortening a file in-place by deleting data from the beginning or middle. Most file systems don't work in a way that would support it. When one wants to delete data from such a position, one has to make a new copy of the file, omitting the data that are to be deleted.
You can shorten a file by removing a tail, but that does not serve your purpose unless possibly if you send chunks in reverse order, from tail to head. However, the most natural ways I can think of to support that in an application such as yours would involve pre-allocating the full-size destination file, and that would have the same problem you are trying to solve.

Can you read ... chunk of a file?
You can seek to an offset to start reading from anywhere of a file, and you can stop reading once you've read the chunk entirely.
Can you ... delete chunk of a file?
Operating systems present files with an interface similar to a vector. Deleting a chunk from the end of a file is trivial like it is trivial to erase elements from the end of a vector. You can simply call std::filesystem::resize_file with the new size which is the chunk size subtracted from the original size.
Deleting a chunk from elsewhere is more complex. You must first seek to the beginning of the chunk, then copy the content from after the chunk into the start of the chunk and forward. The complexity of this operation is linear in distance from the start of the chunk to the end of the file. When you have copied all of the trailing content, you can resize to remove the excess.

Related

Recover object lazy-loading the containing file

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;
Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.
You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

Remove bytes in the middle of a file without moving the end?

For example if i have lots of data entry's stored in a file, each with different sizes, and i have 1000 entries which makes the file like 100MB large, if i then wanted to remove an entry in the middle of the file which is size of 50KB, how can i remove that empty 50KB of bytes in the file without moving all the end bytes up to fill it?
I am using winapi functions such as these for the file management:
CreateFile, WriteFile, ReadFile and SetFilePointerEx
If you really want to do that, set a flag in your entry. When you want to remove an entry from your file, simply invalidate that flag(logical removal) w/o deleting it physically. The next time you add an entry, just go through the file, look for the first invalidated entry, and overwrite it. If all are validated, append it to the end. This takes O(1) time to remove an entry and O(n) to add a new entry, assuming that reading/writing a single entry from/to disk is the basic operation.
You can even optimize it further. At the beginning of the file, store a bit map(1 for invalidated). E.g., 0001000... represents that the 4th entry in your file is invalidated. When you add an entry, search for the first 1 in the bit map and use Random file I/O (in contrast with sequential file I/O) to redirect the file pointer to that entry-to-overwrite directly. Adding in this way only takes O(1) time.
Oh, I notice your comment. If you want to do it efficiently with entry removed physically, a simple way is to swap the entry-to-remove with the very last one in your file and remove the last one, assuming your entries are not sorted. The time is also good, which is O(1) for both adding and removing.
Edit: Just as Joe mentioned, this requires that all of your entries have the same size. You can implement one with variable length of entries, but that'll be more complicated than the one in discussion here.
Let A = start of file, B = start of block to remove, C = end of block to remove
CreateFile with flag FILE_FLAG_RANDOM_ACCESS
SetFilePointerEx to position C, read to EOF into buffer (this may be a large buffer given your file size. Be careful with gigantic records, because any File IO operation now has to allocate virtual memory of the record size to do any simple operation such as move).
Copy buffer to position B in file
Should now be at position B + sizeof(block C). Call SetEndOfFile to truncate the file at that position, then close.
Note that this could be done way easier with the memmove function. However this requires you to map the entire file into memory, make the move, and write it back out. This is great for small files, but files larger than 50-100MB I would caution you about having enough available contiguous virtual address space.
You can simply keep flagging the unused space, and after some time when the internal fragmentation exceeds a certain ratio then you can run a routine which will compact the file. With this scheme the removals would be fast, but some periodic reorganization is needed. If you have a separate file handling scheme, then you can divide the file in some chunks and then keep track of the free chunks and when deleting mark the chunk as unused and keep track of it, and later in the case of an insertion reuse it. This scheme will depend on the type of records in your file, fixed or variable length records.

How to delete parts from a binary file in C++

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.
What I would like to do:
Search for a ANSI string "something"
Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.
Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?
How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here:
How to look for an ANSI string in a binary file?
How can I delete n bytes and write it out to a new file efficiently?
OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.
There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.
Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str
For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.
First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.
As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.
I am going to assume you will load into memory or memory map it for the following.
You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.
To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.
If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.

Delete a line in an ofstream in C++

I want to erase lines within a file. I know you can store the content of the file (in a vector for example), erase the line and write again. However, it feels very cumbersome, and not very efficient if the file gets bigger.
Anyone knows of a better, more efficient, more elegant way of doing it?
On most file-systems, this is the only option you have, short of switching to an actual database.
However, if you find yourself in this situation (i.e. very large files, with inserts/deletes in the middle), consider whether you can do something like maintaining a bitmap at the top of the file, where each bit represents one line of your file. To "delete" a line, simply flip the corresponding bit value.
There's nothing particularly magical about disk files. They still like to store their data in contiguous areas (typically called something like "blocks"). They don't have ways of leaving data-free holes in the middle of these areas. So if you want to "remove" three bytes from the middle of one of these areas, something somewhere is going to have to accomplish this by moving everything else in that area back by three bytes. No, it is not efficient.
This is why text editors (which have to do this kind of thing a lot) tend to load as much of the file as possible (if not all of it) into RAM, where moving data around is much faster. They typically only write changes back to disk when requested (or periodically). If you are going to have to make lots of changes like this, I'd suggest taking a page from their book and doing something similar.
The BerkeleyDB (dbopen(3)) has an access method called DB_RECNO. This allows one to manipulate files with arbitrary lengths using any sort of record delimiter. The default uses variable-length records with unix newlines as delimiters. You then access each "record" using an integer index. Using this, you can delete arbitrary lines from your text file. This isn't specific to C++, but if you are on most Unix/Linux systems, this API is already available to you.

Using the pagefile for caching?

I have to deal with a huge amount of data that usually doesn't fit into main memory. The way I access this data has high locality, so caching parts of it in memory looks like a good option. Is it feasible to just malloc() a huge array, and let the operating system figure out which bits to page out and which bits to keep?
Assuming the data comes from a file, you're better off memory mapping that file. Otherwise, what you end up doing is allocating your array, and then copying the data from your file into the array -- and since your array is mapped to the page file, you're basically just copying the original file to the page file, and in the process polluting the "cache" (i.e., physical memory) so other data that's currently active has a much better chance of being evicted. Then, when you're done you (typically) write the data back from the array to the original file, which (in this case) means copying from the page file back to the original file.
Memory mapping the file instead just creates some address space and maps it directly to the original file instead. This avoids copying data from the original file to the page file (and back again when you're done) as well as temporarily moving data into physical memory on the way from the original file to the page file. The biggest win, of course, is when/if there are substantial pieces of the original file that you never really use at all (in which case they may never be read into physical memory at all, assuming the unused chunk is at least a page in size).
If the data are in a large file, look into using mmap to read it. Modern computers have so much RAM, you might not enough swap space available.