C++ - efficiently reading a file sequentially

C++ - efficiently reading a file sequentially - c++

I need to sequentially read a file in C++, dealing with 4 characters at a time (but it's a sliding window, so the next character is handled along with the 3 before it). I could read chunks of the file into a buffer (I know mmap() will be more efficient but I want to stick to platform-independent plain C++), or I could read the file a character at a time using std::cin.read(). The file could be arbitrary large, so reading the whole file is not an option.
Which approach is more efficient?

The most efficient method is to read a lot of data into memory using the fewest function calls or requests.
The objective is to keep the hard drive spinning. One of the bottlenecks is waiting for the hard drive to spin to correct speed. Another is trying to locate the sectors on the hard drive where your requested data lives. A third bottleneck is collisions with the database and memory.
So I vote for the read method into a buffer and search the buffer.

Determine what the largest chunk of data you can read at a time. Then read the file by the chunks.
Say you can only deal with 2K characters at a time. Then, use:
std::ifstream if(filename);
char chunk[2048];
while ( if.read(chunk, 2048)) )
{
std::streamsize nread = in.gcount();
// Process nread number of characters of the chunk.
}

Related

What's an efficient way to randomize the ordering of the contents of a very large file?

For my neural-network-training project, I've got a very large file of input data. The file format is binary, and it consists of a very large number of fixed-size records. The file is currently ~13GB, but in the future it could become larger; for the purposes of this question let's assume it will be too large to just hold all of it in my computer's RAM at once.
Today's problem involves a little utility program I wrote (in C++, although I think choice of language doesn't matter too much here as one would likely encounter the same problem in any language) that is intended to read the big file and output a similar big file -- the output file is to contain the same data as the input file, except with the records shuffled into a random ordering.
To do this, I mmap() the input file into memory, then generate a list of integers from 1 to N (where N is the number of records in the input file), randomly shuffle the ordering of that list, then iterate over the list, writing out to the output file the n'th record from the mmap'd memory area.
This all works correctly, as far as it goes; the problem is that it's not scaling very well; that is, as the input file's size gets bigger, the time it takes to do this conversion is increasing faster than O(N). It's getting to the point where it's become a bottleneck for my workflow. I suspect the problem is that the I/O system (for MacOS/X 10.13.4, using the internal SSD of my Mac Pro trashcan, in case that's important) is optimized for sequential reads, and jumping around to completely random locations in the input file is pretty much the worst-case scenario as far as caching/read-ahead/other I/O optimizations are concerned. (I imagine that on a spinning disk it would perform even worse due to head-seek delays, but fortunately I'm at least using SSD here)
So my question is, is there any clever alternative strategy or optimization I could use to make this file-randomization-process more efficient -- one that would scale better as the size of my input files increases?

If problem is related to swapping and random disk access while reading random file locations, can you at least read input file sequentially?
When you're accessing some chunk in mmap-ed file, prefetcher will think that you'll need adjacent pages soon, so it will also load them. But you won't, so these pages will be discarded and loading time will be wasted.
create array of N toPositons, so toPosition[i]=i;
randomize destinations (are you using knuth's shuffle?);
then toPosition[i] = destination of input[i]. So, read input data sequentially from start and place them into corresponding place of destination file.
Perhaps, this will be more prefetcher-friendly. Of course, writing data randomly is slow too, but at least, you won't waste prefetched pages from input file.
Additional benefit is that when you've processed few millions of input data pages, these GBs will be unloaded from RAM and you'll never need them again, so you won't pollute actual disk cache. Remember that actual memory page size is at least 4K, so even when you're randomly accessing 1 byte of mmap-ed file, at least 4K of data should be read from disk into cache.

I'd recommend not using mmap() - there's no way all that memory pressure is any help at all, and unless you're re-reading the same data multiple times, mmap() is often the worst-performing way to read data.
First, generate your N random offsets, then, given those offsets, use pread() to read the data - and use low-level C-style IO.
This uses the fcntl() function to disable the page cache for your file. Since you're not re-reading the same data, the page cache likely does you little good, but it does use up RAM, slowing other things down. Try it both with and without the page cache disabled and see which is faster. Note also that I've left out all error checking:
(I'm also assuming C-style IO functions are in namespace std on a MAC, and I've used C-style strings and arrays to match the C-style IO functions while keeping the code simpler.)
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
void sendRecords( const char *dataFile, off_t offsets, size_t numOffsets )
{
int fd = std::open( dataFile, O_RDONLY );
// try with and without this
std::fcntl( fd, F_NOCACHE, 1 );
// can also try using page-aligned memory here
char data[ RECORD_LENGTH ];
for ( size_t ii = 0; ii < numOffsets; ii++ )
{
ssize_t bytesRead = std::pread( fd, data, sizeof( data ), offsets[ ii ] );
// process this record
processRecord( data );
}
close( datafd );
}
Assuming you have a file containing precalculated random offsets:
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>
void sendRecords( const char *dataFile, const char *offsetFile )
{
int datafd = std::open( dataFile, O_RDONLY );
// try with and without this
std::fcntl( fd, F_NOCACHE, 1 );
int offsetfd = std::open( offsetFile, O_RDONLY );
// can also try using page-aligned memory here
char data[ RECORD_LENGTH ];
for ( ;; )
{
off_t offset;
ssize_t bytesRead = std::read( offsetfd, &offset, sizeof( offset ) );
if ( bytesRead != sizeof( offset ) )
{
break;
}
bytesRead = std::pread( fd, data, sizeof( data ), offset );
// process this record
processRecord( data );
}
std::close( datafd );
std::close( offsetfd );
}
You can go faster, too, since that code alternates reading and processing, and it'd probably be faster to use multiple threads to read and process simultaneously. It's not that hard to use one or more threads to read data into preallocated buffers that you then queue up and send to your processing thread.

Thanks to advice of various people in this thread (in particular Marc Glisse and Andrew Henle) I was able to reduce the execution time of my program on a 13GB input file, from ~16 minutes to ~2 minutes. I'll document how I did it in this answer, since the solution wasn't very much like either of the answers above (it was more based on Marc's comment, so I'll give Marc the checkbox if/when he restates his comment as an answer).
I tried replacing the mmap() strategy with pread(), but that didn't seem to make much difference; and I tried passing F_NOCACHE and various other flags to fcntl(), but they seemed to either have no effect or make things slower, so I decided to try a different approach.
The new approach is to do things in a 2-layer fashion: rather than reading in single records at a time, my program now loads in "blocks" of sequential records from the input file (each block containing around 4MB of data).
The blocks are loaded in random order, and I load in blocks until I have a certain amount of block-data held in RAM (currently ~4GB, as that is what my Mac's RAM can comfortably hold). Then I start grabbing random records out of random in-RAM blocks, and writing them to the output file. When a given block no longer has any records left in it to grab, I free that block and load in another block from the input file. I repeat this until all blocks from the input file have been loaded and all their records distributed to the output file.
This is faster because all of my output is strictly sequential, and my input is mostly sequential (i.e. 4MB of data is read after each seek rather than only ~2kB). The ordering of the output is slightly less random than it was, but I don't think that will be a problem for me.

How to increase speed of reading data on Windows using c++

I am reading block of data from volume snapshot using CreateFile/ReadFile and buffersize of 4096 bytes.
The problem I am facing is ReadFile is too slow, I am able to read 68439 blocks i.e. 267 Mb in 45 seconds, How can I increase the speed? Below is a part of my code that I am using,
block_handle = CreateFile(block_file,GENERIC_READ,FILE_SHARE_READ,0,OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,NULL);
if(block_handle != INVALID_HANDLE_VALUE)
{
DWORD pos = -1;
for(ULONG i = 0; i < 68439; i++)
{
sectorno = (i*8);
distance = sectorno * sectorsize;
phyoff.QuadPart = distance;
if(pos != phyoff.u.LowPart)
{
pos=SetFilePointer(block_handle, phyoff.u.LowPart,&phyoff.u.HighPart,FILE_BEGIN);
if (phyoff.u.LowPart == INVALID_SET_FILE_POINTER && GetLastError() != NO_ERROR)
{
printf("SetFilePointer Error: %d\n", GetLastError());
phyoff.QuadPart = -1;
return;
}
}
ret = ReadFile(block_handle, data, 4096, &dwRead, 0);
if(ret == FALSE)
{
printf("Error Read");
return;
}
pos += 4096;
}
}
Should I use OVERLAPPED structure? or what can be the possible solution?
Note: The code is not threaded.
Awaiting a positive response.

I'm not quite sure why you're using these extremely low level system functions for this.
Personally I have used C-style file operations (using fopen and fread) as well as C++-style operations (using fstream and read, see this link), to read raw binary files. From a local disk the read speed is on the order of 100MB/second.
In your case, if you don't want to use the standard C or C++ file operations, my guess is that the reason your code is slower is due to the fact that you're performing a seek after each block. Do you really need to call SetFilePointer for every block? If the blocks are sequential you shouldn't need to do this.
Also, experiment with different block sizes, don't be afraid to go up and beyond 1MB.

Your problem is the fragmented data reads. You cannot solve this by fiddling with ReadFile parameters. You need to defragment your reads. here are three approaches:
Defragment the data on the disk
Defragment the reads. That is, collect all the reads you need, but do not read anything yet. Sort the reads into order. Read everything in order, skipping the SetFilePointer wherever possible ( i.e. sequential blocks ). This will speed the total read greatly, but introduce a lag before the first read starts.
Memory map the data. Copy ALL the data into memory and do random access reads from memory. Whether or not this is possible depends on how much data there is in total.
Also, you might want to get fancy, and experiment with caching. When you read a block of data, it might be that although the next read is not sequential, it may well have a high probability of being close by. So when you read a block, sequentially read an enormous block of nearby data into memory. Before the next read, check if the new read is already in memory - thus saving a seek and a disk access. Testing, debugging and tuning this is a lot of work, so I do not really recommend it unless this is a mission critical optimization. Also note that your OS and/or your disk hardware may already be doing something along these lines, so be prepared to see no improvement whatsoever.

If possible, read sequentially (and tell CreateFile you intend to read sequentially with FILE_FLAG_SEQUENTIAL_SCAN).
Avoid unnecessary seeks. If you're reading sequentially, you shouldn't need any seeks.
Read larger chunks (like an integer multiple of the typical cluster size). I believe Windows's own file copy uses reads on the order of 8 MB rather than 4 KB. Consider using an integer multiple of the system's allocation granularity (available from GetSystemInfo).
Read from aligned offsets (you seem to be doing this).
Read to a page-aligned buffer. Consider using VirtualAlloc to allocate the buffer.
Be aware that fragmentation of the file can cause expensive seeking. There's not much you can do about this.
Be aware that volume compression can make seeks especially expensive because it may have to decompress the file from the beginning to find the starting point in the middle of the file.
Be aware that volume encryption might slow things down. Not much you can do but be aware.
Be aware that other software, like anti-malware, may be scanning the entire file every time you touch it. Fewer operations will minimize this hit.

Reading large txt efficiently in c++

I have to read a large text file (> 10 GB) in C++. This is a csv file with variable length lines. when I try to read line by line using ifstream it works but takes long time, i guess this is becuase each time I read a line it goes to disk and reads, which makes it very slow.
Is there a way to read in bufferes, for example read 250 MB at one shot (using read method of ifstream) and then get lines from this buffer, i see lot of issues with solution like buffer can have incomplete lines etc..
Is there a solution for this in c++ which handles all these cases etc. Are there any open source libraries that can do this for example boost etc ?
Note: I would want to avoid c stye FILE* pointers etc.

Try using the Windows memory mapped file function. The calls are buffered and you get to treat a file as if its just memory.
memory mapped files

IOstreams already use buffers much as you describe (though usually only a few kilobytes, not hundreds of megabytes). You can use pubsetbuf to get it to use a larger buffer, but I wouldn't expect any huge gains. Most of the overhead in IOstreams stems from other areas (like using virtual functions), not from lack of buffering.
If you're running this on Windows, you might be able to gain a little by writing your own stream buffer, and having it call CreateFile directly, passing (for example) FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_NO_BUFFERING. Under the circumstances, either of these may help your performance substantially.

If you want real speed, then you're going to have to stop reading lines into std::string, and start using char*s into the buffer. Whether you read that buffer using ifstream::read() or memory mapped files is less important, though read() has the disadvantage you note about potentially having N complete lines and an incomplete one in the buffer, and needing to recognise that (can easily do that by scanning the rest of the buffer for '\n' - perhaps by putting a NUL after the buffer and using strchr). You'll also need to copy the partial line to the start of the buffer, read the next chunk from file so it continues from that point, and change the maximum number of characters read such that it doesn't overflow the buffer. If you're nervous about FILE*, I hope you're comfortable with const char*....
As you're proposing this for performance reasons, I do hope you've profiled to make sure that it's not your CSV field extraction etc. that's the real bottleneck.

I hope this helps -
http://www.cppprog.com/boost_doc/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
BTW, you wrote "i see lot of issues with solution like buffer can have incomplete lines etc.." - in this situation how about reading 250 MB and then read char by char until you get the delimiter to complete the line.

Reading text files

I'm trying to find out what is the best way to read large text (at least 5 mb) files in C++, considering speed and efficiency. Any preferred class or function to use and why?
By the way, I'm running on specifically on UNIX environment.

The stream classes (ifstream) actually do a good job; assuming you're not restricted otherwise make sure to turn off sync_with_stdio (in ios_base::). You can use getline() to read directly into std::strings, though from a performance perspective using a fixed buffer as a char* (vector of chars or old-school char[]) may be faster (at a higher risk/complexity).
You can go the mmap route if you're willing to play games with page size calculations and the like. I'd probably build it out first using the stream classes and see if it's good enough.
Depending on what you're doing with each line of data, you might start finding your processing routines are the optimization point and not the I/O.

Use old style file io.
fopen the file for binary read
fseek to the end of the file
ftell to find out how many bytes are in the file.
malloc a chunk of memory to hold all of the bytes + 1
set the extra byte at the end of the buffer to NUL.
fread the entire file into memory.
create a vector of const char *
push_back the address of the first byte into the vector.
repeatedly
strstr - search the memory block for the carriage control character(s).
put a NUL at the found position
move past the carriage control characters
push_back that address into the vector
until all of the text in the buffer has been processed.
----------------
use the vector to find the strings,
and process as needed.
when done, delete the memory block
and the vector should self-destruct.

If you are using text file storing integers, floats and small strings, my experience is that FILE, fopen, fscanf are already fast enough and also you can get the numbers directly. I think memory mapping is the fastest, but it requires you to write code to parse the file, which needs extra work.

When to build your own buffer system for I/O (C++)?

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. This did not work, the performance is more or less the same. I have the same problem reading the files. I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).
Any idea about how to improve the performance?
PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy.

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks:
The HDD itself
The HDD interface (IDE/SATA/RAID/USB?)
Operating system/filesystem
C/C++ Library
Your code
I'd start by doing some measurements:
How long does your code take to read/write a 2GB file,
How fast can the 'dd' command read and write to disk? Example...
dd if=/dev/zero bs=1024 count=2000000 of=file_2GB
How long does it take to write/read using just big fwrite()/fread() calls
Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds.
How long is it actually taking?
Hi Roddy, using fstream read method
with 1.1 GB files and large
buffers(128,255 or 512 MB) it takes
about 43-48 seconds and it is the same
using fstream getline (line by line).
cp takes almost 2 minutes to copy the
file.
In which case, your're hardware-bound. cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. So it will (as you see) be more than twice as bad as the simple 'read' case.
To improve the speed, the first thing I'd try is a faster hard drive, or an SSD.
You haven't said what the disk interface is? SATA is pretty much the easiest/fastest option. Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound...

I would also suggest memory-mapped files but if you're going to use boost I think boost::iostreams::mapped_file is a better match than boost::interprocess.

Maybe you should look into memory mapped files.
Check them in this library : Boost.Interprocess

Just a thought, but avoid using std::endl as this will force a flush before the buffer is full. Use '\n' instead for a newline.

Don't use new to allocate the buffer like that:
Try: std::vector<>
unsigned int buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
std::vector<char> data_buffer(buffer_size);
_file->read(&data_buffer[0], buffer_size);
Also read the article on using underscore in identifier names:. Note your code is OK but.

Using getline() may be inefficient because the string buffer may need to be re-sized several times as data is appended to it from the stream buffer. You can make this more efficient by pre-sizing the string:
Also you can set the size of the iostreams buffer to either very large or NULL(for unbuffered)
// Unbuffered Accesses:
fstream file;
file.rdbuf()->pubsetbuf(NULL,0);
file.open("PLOP");
// Larger Buffer
std::vector<char> buffer(64 * 1024 * 1024);
fstream file;
file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());
file.open("PLOP");
std::string line;
line.reserve(64 * 1024 * 1024);
while(getline(file,line))
{
// Do Stuff.
}

If you are going to buffer the file yourself, then I'd advise some testing using unbuffered I/O (setvbuf on a file that you've fopened can turn off the library buffering).
Basically, if you are going to buffer yourself, you want to disable the library's buffering, as it's only going to cause you pain. I don't know if there is any way to do that for STL I/O, so I recommend going down to the C-level I/O.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js