How to look for an ANSI string in a binary file? - c++

I would like to find the first occurence of an ANSI string in a binary file, using C++.
I know the string class has a handy find function, but I don't know how can I use it if the file is big, say 5-10 MB.
Do I need to copy the whole file into a string in memory first? If yes, how can I be sure that none of the binary characters get corrupted while copying?
Or is there a more efficient way to do it, without the need for copying it into a string?

Do I need to copy the whole file into a string in memory first?
No.
Or is there a more efficient way to do it, without the need for copying it into a string?
Of course; open the file with an std::ifstream (be sure to open in binary mode rather than text mode), create a pair of multi_pass iterators (from Boost.Spirit) around the stream, then search for the string with std::search.

First of all, don't worry about corrupted characters. (But don't forget to open the file in binary mode either!) Now, suppose your search string is n characters long. Then you can search the whole file a block at a time, as long as you make sure to keep the last n-1 characters of each block to prepend to the next block. That way you won't lose matches that occur across block boundaries. So you can use that handy find function without having to read the whole file into memory at once.

if you can mmap the file into memory, you can avoid the copy.

Related

Iterating through an array or getting characters from an open file - are there any advantages of one over the other?

I'm just wondering if say...
ifstream fin(xxx)
then
char c;
fin.get(c)
is any better than putting the entire text into an array and iterating through the array instead of getting characters from the loaded file.
I guess there's the extra step to put the input file into an array.
If the file is 237 GB, then iterating over it is more feasible than copying it to a memory array.
If you iterate, you still want to do the actual disk I/O in page sized chunks (not go to the device for every byte). But streams usually provide that kind of buffering.
So what you want is a mix of both.

Edit the first line of a huge file in c++

Is there any "fast" way to edit the first line of a big file(~100Mg) in C++?
I know we can read the file line by line, make changes, write it to a temporary file, and rename the temporary file. But, I am wondering if there is a faster way of doing this (something like in-place modification)?
You can probably use the fwrite/fprintf file manipulation methods to be able to write to the file depending on the file pointer's position.
You open the file with fopen for appending, use fseek to the beginning and write what you need. However, you should be careful with the length of the first line. If you write less than the original line you will still have that extra content left over. If you write more you will overwrite your other content.
100MB is not that big on modern computers. If this is a one time deal and you're not working on a really slow device, you can simply read the whole file, split it into lines, make your edit and write it all back in a moment.
If this is something that's going to happen more often, you could benefit from simply adding some whitespace padding to the first line (if possible) to create a "buffer" for things that you can put there the next time. Then you can use fwrite to overwrite just that first line, without touching the rest of the file.
There may be OS and filesystem specific ways to allocate additional space inside an existing file without moving the data. For example on Linux with XFS/ext4 you can use fallocate:
int fallocate(int fd, int mode, off_t offset, off_t len);
fallocate() allows the caller to directly manipulate the allocated disk space for the file referred to by fd for the byte range starting at offset and continuing for len bytes.
I believe the fastest way to accomplish your task is to create a new file that contains the first line value. Whenever you take a request to read the file, you read the first line value file first, then read the larger file, skipping over the first line that is actually stored with the larger file. Whenever you want to change the first line, just change the first line file.
You're thinking of a memory-mapped file, in which the entire file is "mapped" into memory but not actually loaded or rewritten until you attempt to access or modify a part of it. On POSIX systems, you can mmap() a part of a file (say, the first kilobyte), modify it as necessary, then use msync() to write just that chunk of memory back to the disk.

Read word from a text file

This is the requirement I must follow:
There will be a C style or C++ style string to hold the word. An int to hold a count of each word. A struct or class to hold both of these. This struct/class will be inserted into an STL list. You will also need a C style or C++ style string to hold the line of text you read from the files. You will parse this line into words as per the word definition in the assign spec.
The first part seems alright, but in the second one, I still don't get the point about reading a line then parsing it into a word. Is it more efficient than reading straight a word from text file by using?
The efficiency depends on the definition of the word (which comes from the assignment spec.): if you need to go through the linem more than once to determine where a word begins/ends (i.e. what belongs to a word), it is more efficient to keep the line in memory, then perform the read from disk multiple times (although the performance impact can be lessened by I/O cache).
Even if there is no performance gain, this being a homework assignment, I think you are asked to do this to learn 1) how to read strings (lines) from a file; 2) how to parse a string in memory. To achieve the two goals at once, you have this requirement
read per line from the file using fstream and then parse it into words by checing for space and till end of line in a loop.
Depending on your use case, it can be useful to read files line by line.
Reading the whole file in memory first and parsing it afterward do not minimize memory usage. The memory required for your program to run would be at least the size of the file. If the input file is big compared to the memory available to your program, you won't be able to allocate enough memory to store the entire file (try to allocate a string of 20GB to see what happens).
On the other hand, if you read line by line, only the size of one line is needed in memory at a time: you can release memory allocated for previous lines immediately.
So parsing line by line is useful if:
The input files are too big to fit entirely in memory
The size of each line is small enough (reading line by line does not help if the file is made of one large line)

How to delete parts from a binary file in C++

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.
What I would like to do:
Search for a ANSI string "something"
Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.
Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?
How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here:
How to look for an ANSI string in a binary file?
How can I delete n bytes and write it out to a new file efficiently?
OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.
There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.
Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str
For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.
First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.
As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.
I am going to assume you will load into memory or memory map it for the following.
You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.
To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.
If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.

Reading text files

I'm trying to find out what is the best way to read large text (at least 5 mb) files in C++, considering speed and efficiency. Any preferred class or function to use and why?
By the way, I'm running on specifically on UNIX environment.
The stream classes (ifstream) actually do a good job; assuming you're not restricted otherwise make sure to turn off sync_with_stdio (in ios_base::). You can use getline() to read directly into std::strings, though from a performance perspective using a fixed buffer as a char* (vector of chars or old-school char[]) may be faster (at a higher risk/complexity).
You can go the mmap route if you're willing to play games with page size calculations and the like. I'd probably build it out first using the stream classes and see if it's good enough.
Depending on what you're doing with each line of data, you might start finding your processing routines are the optimization point and not the I/O.
Use old style file io.
fopen the file for binary read
fseek to the end of the file
ftell to find out how many bytes are in the file.
malloc a chunk of memory to hold all of the bytes + 1
set the extra byte at the end of the buffer to NUL.
fread the entire file into memory.
create a vector of const char *
push_back the address of the first byte into the vector.
repeatedly
strstr - search the memory block for the carriage control character(s).
put a NUL at the found position
move past the carriage control characters
push_back that address into the vector
until all of the text in the buffer has been processed.
----------------
use the vector to find the strings,
and process as needed.
when done, delete the memory block
and the vector should self-destruct.
If you are using text file storing integers, floats and small strings, my experience is that FILE, fopen, fscanf are already fast enough and also you can get the numbers directly. I think memory mapping is the fastest, but it requires you to write code to parse the file, which needs extra work.