Reading text files - c++

I'm trying to find out what is the best way to read large text (at least 5 mb) files in C++, considering speed and efficiency. Any preferred class or function to use and why?
By the way, I'm running on specifically on UNIX environment.

The stream classes (ifstream) actually do a good job; assuming you're not restricted otherwise make sure to turn off sync_with_stdio (in ios_base::). You can use getline() to read directly into std::strings, though from a performance perspective using a fixed buffer as a char* (vector of chars or old-school char[]) may be faster (at a higher risk/complexity).
You can go the mmap route if you're willing to play games with page size calculations and the like. I'd probably build it out first using the stream classes and see if it's good enough.
Depending on what you're doing with each line of data, you might start finding your processing routines are the optimization point and not the I/O.

Use old style file io.
fopen the file for binary read
fseek to the end of the file
ftell to find out how many bytes are in the file.
malloc a chunk of memory to hold all of the bytes + 1
set the extra byte at the end of the buffer to NUL.
fread the entire file into memory.
create a vector of const char *
push_back the address of the first byte into the vector.
repeatedly
strstr - search the memory block for the carriage control character(s).
put a NUL at the found position
move past the carriage control characters
push_back that address into the vector
until all of the text in the buffer has been processed.
----------------
use the vector to find the strings,
and process as needed.
when done, delete the memory block
and the vector should self-destruct.

If you are using text file storing integers, floats and small strings, my experience is that FILE, fopen, fscanf are already fast enough and also you can get the numbers directly. I think memory mapping is the fastest, but it requires you to write code to parse the file, which needs extra work.

Related

Iterating through an array or getting characters from an open file - are there any advantages of one over the other?

I'm just wondering if say...
ifstream fin(xxx)
then
char c;
fin.get(c)
is any better than putting the entire text into an array and iterating through the array instead of getting characters from the loaded file.
I guess there's the extra step to put the input file into an array.
If the file is 237 GB, then iterating over it is more feasible than copying it to a memory array.
If you iterate, you still want to do the actual disk I/O in page sized chunks (not go to the device for every byte). But streams usually provide that kind of buffering.
So what you want is a mix of both.

C++ - efficiently reading a file sequentially

I need to sequentially read a file in C++, dealing with 4 characters at a time (but it's a sliding window, so the next character is handled along with the 3 before it). I could read chunks of the file into a buffer (I know mmap() will be more efficient but I want to stick to platform-independent plain C++), or I could read the file a character at a time using std::cin.read(). The file could be arbitrary large, so reading the whole file is not an option.
Which approach is more efficient?
The most efficient method is to read a lot of data into memory using the fewest function calls or requests.
The objective is to keep the hard drive spinning. One of the bottlenecks is waiting for the hard drive to spin to correct speed. Another is trying to locate the sectors on the hard drive where your requested data lives. A third bottleneck is collisions with the database and memory.
So I vote for the read method into a buffer and search the buffer.
Determine what the largest chunk of data you can read at a time. Then read the file by the chunks.
Say you can only deal with 2K characters at a time. Then, use:
std::ifstream if(filename);
char chunk[2048];
while ( if.read(chunk, 2048)) )
{
std::streamsize nread = in.gcount();
// Process nread number of characters of the chunk.
}

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

How to delete parts from a binary file in C++

I would like to delete parts from a binary file, using C++. The binary file is about about 5-10 MB.
What I would like to do:
Search for a ANSI string "something"
Once I found this string, I would like to delete the following n bytes, for example the following 1 MB of data. I would like to delete those character, not to fill them with NULL, thus make the file smaller.
I would like to save the modified file into a new binary file, what is the same as the original file, except for the missing n bytes what I have deleted.
Can you give me some advice / best practices how to do this the most efficiently? Should I load the file into memory first?
How can I search efficiently for an ANSI string? I mean possibly I have to skip a few megabytes of data before I find that string. >> I have been told I should ask it in an other question, so its here:
How to look for an ANSI string in a binary file?
How can I delete n bytes and write it out to a new file efficiently?
OK, I don't need it to be super efficient, the file will not be bigger than 10 MB and its OK if it runs for a few seconds.
There are a number of fast string search routines that perform much better than testing each and every character. For example, when trying to find "something", only every 9th character needs to be tested.
Here's an example I wrote for an earlier question: code review: finding </body> tag reverse search on a non-null terminated char str
For a 5-10MB file I would have a look at writev() if your system supports it. Read the entire file into memory since it is small enough. Scan for the bytes you want to drop. Pass writev() the list of iovecs (which will just be pointers into your read buffer and lenghts) and then you can rewrite the entire modified contents in a single system call.
First, if I understand your meaning in your "How can I search efficiently" subsection, you cannot just skip a few megabytes of data in the search if the target string might be in those first few megabytes.
As for loading the file into memory, if you do that, don't forget to make sure you have enough space in memory for the entire file. You will be frustrated if you go to use your utility and find that the 2GB file you want to use it on can't fit in the 1.5GB of memory you have left.
I am going to assume you will load into memory or memory map it for the following.
You did specifically say this was a binary file, so this means that you cannot use the normal C++ string searching/matching, as the null characters in the file's data will confuse it (end it prematurely without a match). You might instead be able to use memchr to find the first occurrence of the first byte in your target, and memcmp to compare the next few bytes with the bytes in the target; keep using memchr/memcmp pairs to scan through the entire thing until found. This is not the most efficient way, as there are better pattern-matching algorithms, but this is a sort of efficient way, I suppose.
To "delete" n bytes you have to actually move the data after those n bytes, copying the entire thing up to the new location.
If you actually copy the data from disk to memory, then it'd be faster to manipulate it there and write to the new file. Otherwise, once you find the spot on the disk you want to start deleting from, you can open a new file for writing, read in X bytes from the first file, where X is the file pointer position into the first file, and write them right into the second file, then seek into the first file to X+n and do the same from there to file1's eof, appending that to what you've already put into file2.

Reading large txt efficiently in c++

I have to read a large text file (> 10 GB) in C++. This is a csv file with variable length lines. when I try to read line by line using ifstream it works but takes long time, i guess this is becuase each time I read a line it goes to disk and reads, which makes it very slow.
Is there a way to read in bufferes, for example read 250 MB at one shot (using read method of ifstream) and then get lines from this buffer, i see lot of issues with solution like buffer can have incomplete lines etc..
Is there a solution for this in c++ which handles all these cases etc. Are there any open source libraries that can do this for example boost etc ?
Note: I would want to avoid c stye FILE* pointers etc.
Try using the Windows memory mapped file function. The calls are buffered and you get to treat a file as if its just memory.
memory mapped files
IOstreams already use buffers much as you describe (though usually only a few kilobytes, not hundreds of megabytes). You can use pubsetbuf to get it to use a larger buffer, but I wouldn't expect any huge gains. Most of the overhead in IOstreams stems from other areas (like using virtual functions), not from lack of buffering.
If you're running this on Windows, you might be able to gain a little by writing your own stream buffer, and having it call CreateFile directly, passing (for example) FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_NO_BUFFERING. Under the circumstances, either of these may help your performance substantially.
If you want real speed, then you're going to have to stop reading lines into std::string, and start using char*s into the buffer. Whether you read that buffer using ifstream::read() or memory mapped files is less important, though read() has the disadvantage you note about potentially having N complete lines and an incomplete one in the buffer, and needing to recognise that (can easily do that by scanning the rest of the buffer for '\n' - perhaps by putting a NUL after the buffer and using strchr). You'll also need to copy the partial line to the start of the buffer, read the next chunk from file so it continues from that point, and change the maximum number of characters read such that it doesn't overflow the buffer. If you're nervous about FILE*, I hope you're comfortable with const char*....
As you're proposing this for performance reasons, I do hope you've profiled to make sure that it's not your CSV field extraction etc. that's the real bottleneck.
I hope this helps -
http://www.cppprog.com/boost_doc/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
BTW, you wrote "i see lot of issues with solution like buffer can have incomplete lines etc.." - in this situation how about reading 250 MB and then read char by char until you get the delimiter to complete the line.