Reading large txt efficiently in c++

Reading large txt efficiently in c++ - c++

I have to read a large text file (> 10 GB) in C++. This is a csv file with variable length lines. when I try to read line by line using ifstream it works but takes long time, i guess this is becuase each time I read a line it goes to disk and reads, which makes it very slow.
Is there a way to read in bufferes, for example read 250 MB at one shot (using read method of ifstream) and then get lines from this buffer, i see lot of issues with solution like buffer can have incomplete lines etc..
Is there a solution for this in c++ which handles all these cases etc. Are there any open source libraries that can do this for example boost etc ?
Note: I would want to avoid c stye FILE* pointers etc.

Try using the Windows memory mapped file function. The calls are buffered and you get to treat a file as if its just memory.
memory mapped files

IOstreams already use buffers much as you describe (though usually only a few kilobytes, not hundreds of megabytes). You can use pubsetbuf to get it to use a larger buffer, but I wouldn't expect any huge gains. Most of the overhead in IOstreams stems from other areas (like using virtual functions), not from lack of buffering.
If you're running this on Windows, you might be able to gain a little by writing your own stream buffer, and having it call CreateFile directly, passing (for example) FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_NO_BUFFERING. Under the circumstances, either of these may help your performance substantially.

If you want real speed, then you're going to have to stop reading lines into std::string, and start using char*s into the buffer. Whether you read that buffer using ifstream::read() or memory mapped files is less important, though read() has the disadvantage you note about potentially having N complete lines and an incomplete one in the buffer, and needing to recognise that (can easily do that by scanning the rest of the buffer for '\n' - perhaps by putting a NUL after the buffer and using strchr). You'll also need to copy the partial line to the start of the buffer, read the next chunk from file so it continues from that point, and change the maximum number of characters read such that it doesn't overflow the buffer. If you're nervous about FILE*, I hope you're comfortable with const char*....
As you're proposing this for performance reasons, I do hope you've profiled to make sure that it's not your CSV field extraction etc. that's the real bottleneck.

I hope this helps -
http://www.cppprog.com/boost_doc/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
BTW, you wrote "i see lot of issues with solution like buffer can have incomplete lines etc.." - in this situation how about reading 250 MB and then read char by char until you get the delimiter to complete the line.

Related

Why is reading big text file in parallel bad?

I have a big txt file with ~30 millions rows, each row is seperated by a line seperator \n. And I'd like to read all lines to an unordered list (e.g. std::list<std::string>).
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
The current implementation is very slow, so I'm learning how to read data by chunk.
But after seeing this comment:
parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.
Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?

Is it bad to read a file in parallel? What's the algorithm to read all
lines of a file to an unordered container (e.g. std::list, normal
array,...) as fast as possible, without using any libraries, and the
code must be cross-platform?
I guess I'll attempt to answer this one to avoid spamming the comments. I have, in multiple scenarios, sped up text file parsing substantially using multithreading. However, the keyword here is parsing, not disk I/O (though just about any text file read involves some level of parsing). Now first things first:
VTune here was telling me that my top hotspots were in parsing (sorry, this image was taken years ago and I didn't expand the call graph to show what inside obj_load was taking most of the time, but it was sscanf). This profiling session actually surprised me quite a bit. In spite of having been profiling for decades to the point where my hunches aren't too inaccurate (not accurate enough to avoid profiling, mind you, not even close, but I've tuned my sort of intuitive spider senses enough to where profiling sessions usually don't surprise me that much even without any glaring algorithmic inefficiencies -- though I might still be off about exactly why they exist since I'm not so good at assembly).
Yet this time I was really taken back and shocked so this example has always been one I used to show even the most skeptical colleagues who don't want to use profilers to show why profiling is so important. Some of them are actually good at guessing where hotspots exists and some were actually creating very competent-performing solutions in spite of never having used them, but none of them were good at guessing what isn't a hotspot, and none of them could draw a call graph based on their hunches. So I always liked to use this example to try to convert the skeptics and get them to spend a day just trying out VTune (and we had a boatload of free licenses from Intel who worked with us which were largely going to waste on our team which I thought was a tragedy since VTune is a really expensive piece of software).
And the reason I was taken back this time was not because I was surprised by the sscanf hotspot. That's kind of a no-brainer that non-trivial parsing of epic text files is going to generally be bottlenecked by string parsing. I could have guessed that. My colleagues who never touched a profiler could have guessed that. What I couldn't have guessed was how much of a bottleneck it was. I thought given the fact that I was loading millions of polygons and vertices, texture coordinates, normals, creating edges and finding adjacency data, using index FOR compression, associating materials from the MTL file to the polygons, reverse engineering object normals stored in the OBJ file and consolidating them to form edge creasing, etc. I would at least have a good chunk of the time distributed in the mesh system as well (I would have guessed 25-33% of the time spent in the mesh engine).
Turned out the mesh system took barely any time to my most pleasant surprise, and there my hunches were completely off about it specifically. It was, by far, parsing that was the uber bottleneck (not disk I/O, not the mesh engine).
So that's when I applied this optimization to multithread the parsing, and there it helped a lot. I even initially started off with a very modest multithreaded implementation which barely did any parsing except scanning the character buffers for line endings in each thread just to end up parsing in the loading thread, and that already helped by a decent amount (reduced the operation from 16 seconds to about 14 IIRC, and I eventually got it down to ~8 seconds and that was on an i3 with just two cores and hyperthreading). So anyway, yeah, you can probably make things faster with multithreaded parsing of character buffers you read in from text files in a single thread. I wouldn't use threads as a way to make disk I/O any faster.
I'm reading the characters from the file in binary into big char buffers in a single thread, then, using a parallel loop, have the threads figure out integer ranges for the lines in that buffer.
// Stores all the characters read in from the file in big chunks.
// This is shared for read-only access across threads.
vector<char> buffer;
// Local to a thread:
// Stores the starting position of each line.
vector<size_t> line_start;
// Stores the assigned buffer range for the thread:
size_t buffer_start, buffer_end;
Basically like so:
LINE1 and LINE2 are considered to belong to THREAD 1, while LINE3 is considered to belong to THREAD 2. LINE6 is not considered to belong to any thread since it doesn't have an EOL. Instead the characters of LINE6 will be combined with the next chunky buffer read from the file.
Each thread begins by looking at the first character in its assigned character buffer range. Then it works backwards until it finds an EOL or reaches the beginning of the buffer. After that it works forward and parses each line, looking for EOLs and doing whatever else we want, until it reaches the end of its assigned character buffer range. The last "incomplete line" is not processed by the thread, but instead the next thread (or if the thread is the last thread, then it is processed on the next big chunky buffer read by the first thread). The diagram is teeny (couldn't fit much) but I read in the character buffers from the file in the loading thread in big chunks (megabytes) before the threads parse them in parallel loops, and each thread might then parse thousands of lines from its designated buffer range.
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
Kind of echoing Veedrac's comments, storing your lines in std::list<std::string> if you want to really load an epic number of lines quickly is not a good idea. That would actually be a bigger priority to address than multithreading. I'd turn that into just std::vector<char> all_lines storing all the strings, and you can use std::vector<size_t> line_start to store the starting line position of an nth line, which you can retrieve like so:
// note that 'line' will be EOL-terminated rather than null-terminated
// if it points to the original buffer.
const char* line = all_lines.data() + line_start[n];
The immediate problem with std::list without a custom allocator is a heap allocation per node. On top of that we're wasting memory storing two extra pointers per line. std::string is problematic here because SBO optimizations to avoid heap allocation would make it take too much memory for small strings (and thereby increase cache misses) or still end up invoking heap allocations for every non-small string. So you end up avoiding all these problems just storing everything in one giant char buffer, like in std::vector<char>. I/O streams, including stringstreams and functions like getline, are also horrible for performance, just awful, in ways that really disappointed me at first since my first OBJ loader used those and it was over 20 times slower than the second version where I ported all those I/O stream operators and functions and use of std::string to make use of C functions and my own hand-rolled stuff operating on char buffers. When it comes to parsing in performance-critical contexts, C functions like sscanf and memchr and plain old character buffers tend to be so much faster than the C++ ways of doing it, but you can at least still use std::vector<char> to store huge buffers, e.g., to avoid dealing with malloc/free and get some debug-build sanity checks when accessing the character buffer stored inside.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.

So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.

Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Which is faster in memory, ints or chars? And file-mapping or chunk reading?

Okay, so I've written a (rather unoptimized) program before to encode images to JPEGs, however, now I am working with MPEG-2 transport streams and the H.264 encoded video within them. Before I dive into programming all of this, I am curious what the fastest way to deal with the actual file is.
Currently I am file-mapping the .mts file into memory to work on it, although I am not sure if it would be faster to (for example) read 100 MB of the file into memory in chunks and deal with it that way.
These files require a lot of bit-shifting and such to read flags, so I am wondering that when I reference some of the memory if it is faster to read 4 bytes at once as an integer or 1 byte as a character. I thought I read somewhere that x86 processors are optimized to a 4-byte granularity, but I'm not sure if this is true...
Thanks!

Memory mapped files are usually the fastest operations available if you require your file to be available synchronously. (There are some asynchronous APIs that allow the O/S to reorder things for a slight speed increase sometimes, but that sounds like it's not helpful in your application)
The main advantage you're getting with the mapped files is that you can work in memory on the file while it is still being read from disk by the O/S, and you don't have to manage your own locking/threaded file reading code.
Memory reference wise, on the x86 memory is going to be read an entire line at a time no matter what you're actually working with. The extra time associated with non byte granular operations refers to the fact that integers need not be byte aligned. For example, performing an ADD will take more time if things aren't aligned on a 4 byte boundary, but for something like a memory copy there will be little difference. If you are working with inherently character data then it's going to be faster to keep it that way than to read everything as integers and bit shift things around.
If you're doing h.264 or MPEG2 encoding the bottleneck is probably going to be CPU time rather than disk i/o in any case.

If you have to access the whole file, it is always faster to read it to memory and do the processing there. Of course, it's also wasting memory, and you have to lock the file somehow so you won't get concurrent access by some other application, but optimization is about compromises anyway. Memory mapping is faster if you're skipping (large) parts of the file, because you don't have to read them at all then.
Yes, accessing memory at 4-byte (or even 8-byte) granularity is faster than accessing it byte-wise. Again it's a compromise - depending on what you have to do with the data afterwards, and how skilled you are at fiddling with the bits in an int, it might not be faster overall.
As for everything regarding optimization:
measure
optimize
measure

These are sequential bit-streams - you basically consume them one bit at a time without random-access.
You don't need to put a lot of effort into explicitly buffering reads and such in this scenario: the operating system will be buffering them for you anyway. I've written H.264 parsers before, and the time is completely dominated by the decoding and manipulation, not the IO.
My recommendation is to use a standard library and for parsing these bit-streams.
Flavor is such a parser, and the website even includes examples of MPEG-2 (PS) and various H.264 parts like M-Coder. Flavor builds native parsing code from a c++-like language; here's an quote from the MPEG-2 PS spec:
class TargetBackgroundGridDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 7
{
unsigned int(14) horizontal_size;
unsigned int(14) vertical_size;
unsigned int(4) aspect_ratio_information;
}
class VideoWindowDescriptor extends BaseProgramDescriptor : unsigned int(8) tag = 8
{
unsigned int(14) horizontal_offset;
unsigned int(14) vertical_offset;
unsigned int(4) window_priority;
}

Regarding to the best size to read from memory, I'm sure you will enjoy reading this post about memory access performance and cache effects.

One thing to consider about memory-mapping files is that a file with a size greater than the available address range will only be able to be map a portion of the file. To access the remainder of the file requires the first part to be unmapped and the next part to mapped in its place.
Since you're decoding mpeg streams you may want to use a double buffered approach with asynchronous file reading. It works like this:
blocksize = 65536 bytes (or whatever)
currentblock = new byte [blocksize]
nextblock = new byte [blocksize]
read currentblock
while processing
asynchronously read nextblock
parse currentblock
wait for asynchronous read to complete
swap nextblock and currentblock
endwhile

Reading text files

I'm trying to find out what is the best way to read large text (at least 5 mb) files in C++, considering speed and efficiency. Any preferred class or function to use and why?
By the way, I'm running on specifically on UNIX environment.

The stream classes (ifstream) actually do a good job; assuming you're not restricted otherwise make sure to turn off sync_with_stdio (in ios_base::). You can use getline() to read directly into std::strings, though from a performance perspective using a fixed buffer as a char* (vector of chars or old-school char[]) may be faster (at a higher risk/complexity).
You can go the mmap route if you're willing to play games with page size calculations and the like. I'd probably build it out first using the stream classes and see if it's good enough.
Depending on what you're doing with each line of data, you might start finding your processing routines are the optimization point and not the I/O.

Use old style file io.
fopen the file for binary read
fseek to the end of the file
ftell to find out how many bytes are in the file.
malloc a chunk of memory to hold all of the bytes + 1
set the extra byte at the end of the buffer to NUL.
fread the entire file into memory.
create a vector of const char *
push_back the address of the first byte into the vector.
repeatedly
strstr - search the memory block for the carriage control character(s).
put a NUL at the found position
move past the carriage control characters
push_back that address into the vector
until all of the text in the buffer has been processed.
----------------
use the vector to find the strings,
and process as needed.
when done, delete the memory block
and the vector should self-destruct.

If you are using text file storing integers, floats and small strings, my experience is that FILE, fopen, fscanf are already fast enough and also you can get the numbers directly. I think memory mapping is the fastest, but it requires you to write code to parse the file, which needs extra work.

When to build your own buffer system for I/O (C++)?

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. This did not work, the performance is more or less the same. I have the same problem reading the files. I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).
Any idea about how to improve the performance?
PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy.

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks:
The HDD itself
The HDD interface (IDE/SATA/RAID/USB?)
Operating system/filesystem
C/C++ Library
Your code
I'd start by doing some measurements:
How long does your code take to read/write a 2GB file,
How fast can the 'dd' command read and write to disk? Example...
dd if=/dev/zero bs=1024 count=2000000 of=file_2GB
How long does it take to write/read using just big fwrite()/fread() calls
Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds.
How long is it actually taking?
Hi Roddy, using fstream read method
with 1.1 GB files and large
buffers(128,255 or 512 MB) it takes
about 43-48 seconds and it is the same
using fstream getline (line by line).
cp takes almost 2 minutes to copy the
file.
In which case, your're hardware-bound. cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. So it will (as you see) be more than twice as bad as the simple 'read' case.
To improve the speed, the first thing I'd try is a faster hard drive, or an SSD.
You haven't said what the disk interface is? SATA is pretty much the easiest/fastest option. Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound...

I would also suggest memory-mapped files but if you're going to use boost I think boost::iostreams::mapped_file is a better match than boost::interprocess.

Maybe you should look into memory mapped files.
Check them in this library : Boost.Interprocess

Just a thought, but avoid using std::endl as this will force a flush before the buffer is full. Use '\n' instead for a newline.

Don't use new to allocate the buffer like that:
Try: std::vector<>
unsigned int buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
std::vector<char> data_buffer(buffer_size);
_file->read(&data_buffer[0], buffer_size);
Also read the article on using underscore in identifier names:. Note your code is OK but.

Using getline() may be inefficient because the string buffer may need to be re-sized several times as data is appended to it from the stream buffer. You can make this more efficient by pre-sizing the string:
Also you can set the size of the iostreams buffer to either very large or NULL(for unbuffered)
// Unbuffered Accesses:
fstream file;
file.rdbuf()->pubsetbuf(NULL,0);
file.open("PLOP");
// Larger Buffer
std::vector<char> buffer(64 * 1024 * 1024);
fstream file;
file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());
file.open("PLOP");
std::string line;
line.reserve(64 * 1024 * 1024);
while(getline(file,line))
{
// Do Stuff.
}

If you are going to buffer the file yourself, then I'd advise some testing using unbuffered I/O (setvbuf on a file that you've fopened can turn off the library buffering).
Basically, if you are going to buffer yourself, you want to disable the library's buffering, as it's only going to cause you pain. I don't know if there is any way to do that for STL I/O, so I recommend going down to the C-level I/O.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js