Edit the first line of a huge file in c++ - c++

Is there any "fast" way to edit the first line of a big file(~100Mg) in C++?
I know we can read the file line by line, make changes, write it to a temporary file, and rename the temporary file. But, I am wondering if there is a faster way of doing this (something like in-place modification)?

You can probably use the fwrite/fprintf file manipulation methods to be able to write to the file depending on the file pointer's position.
You open the file with fopen for appending, use fseek to the beginning and write what you need. However, you should be careful with the length of the first line. If you write less than the original line you will still have that extra content left over. If you write more you will overwrite your other content.

100MB is not that big on modern computers. If this is a one time deal and you're not working on a really slow device, you can simply read the whole file, split it into lines, make your edit and write it all back in a moment.
If this is something that's going to happen more often, you could benefit from simply adding some whitespace padding to the first line (if possible) to create a "buffer" for things that you can put there the next time. Then you can use fwrite to overwrite just that first line, without touching the rest of the file.
There may be OS and filesystem specific ways to allocate additional space inside an existing file without moving the data. For example on Linux with XFS/ext4 you can use fallocate:
int fallocate(int fd, int mode, off_t offset, off_t len);
fallocate() allows the caller to directly manipulate the allocated disk space for the file referred to by fd for the byte range starting at offset and continuing for len bytes.

I believe the fastest way to accomplish your task is to create a new file that contains the first line value. Whenever you take a request to read the file, you read the first line value file first, then read the larger file, skipping over the first line that is actually stored with the larger file. Whenever you want to change the first line, just change the first line file.

You're thinking of a memory-mapped file, in which the entire file is "mapped" into memory but not actually loaded or rewritten until you attempt to access or modify a part of it. On POSIX systems, you can mmap() a part of a file (say, the first kilobyte), modify it as necessary, then use msync() to write just that chunk of memory back to the disk.

Related

How can I delete data from a sequential file while it is being appended by another process

I'm reading data from a file in a sequential manner while the same file is being appended data by another process. So far so good. My trouble comes when I need to delete from the file the data that I have already retrieved, which I have to do in order to prevent the file from getting too large due to the writing process. I don't need to delete exactly the data that I have just retrieved, but at least do some removal periodically without losing any data that have not already been read. How can I do this with C++?
I understand that there may be different valuable approaches. I'd check as valid answer any that would prove useful to my developing the code.
This is not just a matter of C++, any language you use it will at some point (in its runtime, standard library implementation, interpreter or whatever its architecture is) use the system calls that the system provides for file handling (e.g. open(), read(), write()).
I'm not aware of any system call that will delete parts of a file or replace parts with something else (you can position yourself somewhere in the middle of the file and start overwriting its contents, but this will be a byte for byte change, you can't change a piece of it with another piece with a different size). There are all sorts of workarounds for simulating deleting or changing parts of a file, but nothing that does it directly. For example: read from the original file, write only what you want to keep in a temporary file, remove the original and rename the temporary. But this will not work in your situation if the writing process keeps the file open.
Another approach would be something inspired by logrotate: when the file gets to a certain maximum size it gets switched with a new one, and you can process the previous one as you want. This approach does require changes in the writing process also.
You could specify the file length at the beginning, then start writing in it and when you reach your end of file, you just start writing at the beginning of the file again. But you should make sure that read pointer doesn't pass the writing pointer.
It seems like you're trying to emulate the behavior of a named pipe using a regular file. This would require special support from your operating system, which probably doesn't exist because you should be using named pipes instead. Named pipes are a special kind of file which is used for communication between two processes. Like regular files, it has a path, has a filename and exist on disk. However, where a regular file's contents are stored on disk, the contents of a named pipe only exists in memory and only that data that has been written, but not yet read. This is exactly what you're trying to do.
Assuming you're using a unix based OS. you can run mkfifo outputfile and then use outputfile for reading and appending. No C++ code required, though if you want you can also call mkfifo() from your c++ code.
If you're using Windows, it all becomes a bit more complicated. You have to specify something like \\.\pipe\outputfile as the filename for reading and appending.

std::ofstream::open will it read the entire file into memory?

I'm writing things from my memory to the disk in order to free my memory.
I wonder each time I call open(), and appendix new elements to the end of the file, will it read the entire file into memory? or it is just a pointer to the end of the file?
The fstream implementation doesn't specify exactly what happens if you use the ofstream::app, ios::app, ofstream::ate or ios::ate mode to open the file.
But in any sane implementation, the file is not read into memory, all that happens is that the fstream implementation positions the "current position" to the end of the file.
To read the entire file into memoiry would be rather terrible if you have a system with 2GB of RAM and you wanted to append to a file that is bigger than 2GB.
Being very pedantic, when writing something to a text-file, it is likely that the filesystem that is part of the operating system will read the last few (kilo)bytes of the file, as most hard-disks and similar storage requires that the data is being written to a "block", which is a fixed size (e.g. 512 bytes or 4 kilobytes). So, unless the current filesize is exactly at a boundary of such a block, the filesystem must read the last block of the file and write it back with the additional data that you asked to write.
If you are worried about appending to a log-file that gets very large, no, it's not an issue. If you are worried about memory safety because your file has secret data that you won't want stored in memory, then may be a problem, because a portion of that will probably be loaded into memory, and there is nothing you can do to control that.

Read word from a text file

This is the requirement I must follow:
There will be a C style or C++ style string to hold the word. An int to hold a count of each word. A struct or class to hold both of these. This struct/class will be inserted into an STL list. You will also need a C style or C++ style string to hold the line of text you read from the files. You will parse this line into words as per the word definition in the assign spec.
The first part seems alright, but in the second one, I still don't get the point about reading a line then parsing it into a word. Is it more efficient than reading straight a word from text file by using?
The efficiency depends on the definition of the word (which comes from the assignment spec.): if you need to go through the linem more than once to determine where a word begins/ends (i.e. what belongs to a word), it is more efficient to keep the line in memory, then perform the read from disk multiple times (although the performance impact can be lessened by I/O cache).
Even if there is no performance gain, this being a homework assignment, I think you are asked to do this to learn 1) how to read strings (lines) from a file; 2) how to parse a string in memory. To achieve the two goals at once, you have this requirement
read per line from the file using fstream and then parse it into words by checing for space and till end of line in a loop.
Depending on your use case, it can be useful to read files line by line.
Reading the whole file in memory first and parsing it afterward do not minimize memory usage. The memory required for your program to run would be at least the size of the file. If the input file is big compared to the memory available to your program, you won't be able to allocate enough memory to store the entire file (try to allocate a string of 20GB to see what happens).
On the other hand, if you read line by line, only the size of one line is needed in memory at a time: you can release memory allocated for previous lines immediately.
So parsing line by line is useful if:
The input files are too big to fit entirely in memory
The size of each line is small enough (reading line by line does not help if the file is made of one large line)

How to delete parts from a binary file in C++

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.
What I would like to do:
Search for a ANSI string "something"
Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.
Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?
How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here:
How to look for an ANSI string in a binary file?
How can I delete n bytes and write it out to a new file efficiently?
OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.
There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.
Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str
For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.
First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.
As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.
I am going to assume you will load into memory or memory map it for the following.
You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.
To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.
If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.

How to look for an ANSI string in a binary file?

I would like to find the first occurence of an ANSI string in a binary file, using C++.
I know the string class has a handy find function, but I don't know how can I use it if the file is big, say 5-10 MB.
Do I need to copy the whole file into a string in memory first? If yes, how can I be sure that none of the binary characters get corrupted while copying?
Or is there a more efficient way to do it, without the need for copying it into a string?
Do I need to copy the whole file into a string in memory first?
No.
Or is there a more efficient way to do it, without the need for copying it into a string?
Of course; open the file with an std::ifstream (be sure to open in binary mode rather than text mode), create a pair of multi_pass iterators (from Boost.Spirit) around the stream, then search for the string with std::search.
First of all, don't worry about corrupted characters. (But don't forget to open the file in binary mode either!) Now, suppose your search string is n characters long. Then you can search the whole file a block at a time, as long as you make sure to keep the last n-1 characters of each block to prepend to the next block. That way you won't lose matches that occur across block boundaries. So you can use that handy find function without having to read the whole file into memory at once.
if you can mmap the file into memory, you can avoid the copy.