How to read huge file in c++

How to read huge file in c++ - c++

If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?

There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.

I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:
ifstream is("/path/to/file");
char buf[4096];
do {
is.read(buf, sizeof(buf));
process_chunk(buf, is.gcount());
} while(is);

A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:
Under windows: CreateFileMapping(), MapViewOfFile()
Under linux: open(2) / creat(2), shm_open, mmap
you will need to compile 64bit app to make it work.
for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory

You can use fread
char buffer[size];
fread(buffer, size, sizeof(char), fp);
Or, if you want to use C++ fstreams you can use read as buratino said.
Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.

Related

std::ofstream::open will it read the entire file into memory?

I'm writing things from my memory to the disk in order to free my memory.
I wonder each time I call open(), and appendix new elements to the end of the file, will it read the entire file into memory? or it is just a pointer to the end of the file?

The fstream implementation doesn't specify exactly what happens if you use the ofstream::app, ios::app, ofstream::ate or ios::ate mode to open the file.
But in any sane implementation, the file is not read into memory, all that happens is that the fstream implementation positions the "current position" to the end of the file.
To read the entire file into memoiry would be rather terrible if you have a system with 2GB of RAM and you wanted to append to a file that is bigger than 2GB.
Being very pedantic, when writing something to a text-file, it is likely that the filesystem that is part of the operating system will read the last few (kilo)bytes of the file, as most hard-disks and similar storage requires that the data is being written to a "block", which is a fixed size (e.g. 512 bytes or 4 kilobytes). So, unless the current filesize is exactly at a boundary of such a block, the filesystem must read the last block of the file and write it back with the additional data that you asked to write.
If you are worried about appending to a log-file that gets very large, no, it's not an issue. If you are worried about memory safety because your file has secret data that you won't want stored in memory, then may be a problem, because a portion of that will probably be loaded into memory, and there is nothing you can do to control that.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.

You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

How do you pre-allocate space for a file in C/C++ on Windows?

I'm adding some functionality to an existing code base that uses pure C functions (fopen, fwrite, fclose) to write data out to a file. Unfortunately I can't change the actual mechanism of file i/o, but I have to pre-allocate space for the file to avoid fragmentation (which is killing our performance during reads). Is there a better way to do this than to actually write zeros or random data to the file? I know the ultimate size of the file when I'm opening it.
I know I can use fallocate on linux, but I don't know what the windows equivalent is.
Thanks!

Programatically, on Windows you have to use Win32 API functions to do this:
SetFilePointerEx() followed by SetEndOfFile()
You can use these functions to pre-allocate the clusters for the file and avoid fragmentation. This works much more efficiently than pre-writing data to the file. Do this prior to doing your fopen().
If you want to avoid the Win32 API altogether, you can also do it non-programatically using the system() function to issue the following command:
fsutil file createnew filename filesize

You can use the SetFileValidData function to extend the logical length of a file without having to write out all that data to disk. However, because it can allow to read disk data to which you may not otherwise have been privileged, it requires the SE_MANAGE_VOLUME_NAME privilege to use. Carefully read the Remarks section of the documentation.
I'd recommend instead just writing out the 0's. You can also use SetFilePointerEx and SetEndOfFile to extend the file, but doing so still requires writing out zeros to disk (unless the file is sparse, but that defeats the point of reserving disk space). See Why does my single-byte write take forever? for more info on that.

Sample code, note that it isn't necessarily faster especially with smart filesystems like NTFS.
if ( INVALID_HANDLE_VALUE != (handle=CreateFile(fileName,GENERIC_WRITE,0,0,CREATE_ALWAYS,FILE_FLAG_SEQUENTIAL_SCAN,NULL) )) {
// preallocate 2Gb disk file
LARGE_INTEGER size;
size.QuadPart=2048 * 0x10000;
::SetFilePointerEx(handle,size,0,FILE_BEGIN);
::SetEndOfFile(handle);
::SetFilePointer(handle,0,0,FILE_BEGIN);
}

You could use the _chsize() function.

Check out this example on Code Project. It looks pretty straightforward to set the file size when the file is initially crated.
http://www.codeproject.com/Questions/172979/How-to-create-a-fixed-size-file.aspx
FILE *fp = fopen("C:\\myimage.jpg","ab");
fseek(fp,0,SEEK_END);
long size = ftell(fp);
char *buffer = (char*)calloc(500*1024-size,1);
fwrite(buffer,500*1024-size,1,fp);
fclose(fp);

这篇文章可能对你有帮助。
The following article from Raymond may help.
How can I preallocate disk space for a file without it being reported as readable?
Use the SetFileInformationByHandle function, passing function code
FileAllocationInfo and a FILE_ALLOCATION_INFO structure. “Note that
this will decrease fragmentation, but because each write is still
updating the file size there will still be synchronization and metadata
overhead caused at each append.”
The effect of setting the file allocation info lasts only as long as
you keep the file handle open. When you close the file handle, all the
preallocated space that you didn’t use will be freed.

Which is faster, writing raw data to a drive, or writing to a file?

I need to write data into drive. I have two options:
write raw sectors.(_write(handle, pBuffer, size);)
write into a file (fwrite(pBuffer, size, count, pFile);)
Which way is faster?
I expected the raw sector writing function, _write, to be more efficient. However, my test result failed! fwrite is faster. _write costs longer time.
I've pasted my snippet; maybe my code is wrong. Can you help me out? Either way is okay by me, but I think raw write is better, because it seems the data in the drive is encrypted at least....
#define SSD_SECTOR_SIZE 512
int g_pSddDevHandle = _open("\\\\.\\G:",_O_RDWR | _O_BINARY, _S_IREAD | _S_IWRITE);
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
_write(g_pSddDevHandle,szMemZero,SSD_SECTOR_SIZE);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();
FILE * file = fopen("f:\\test.tmp","a+");
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
fwrite(szMemZero,SSD_SECTOR_SIZE,1,file);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();

Probably because a direct write isn't buffered. When you call fwrite, you are doing buffered writes which tend to be faster in most situations. Essentially, each FILE* handler has an internal buffer which is flushed to disk periodically when it becomes full, which means you end up making less system calls, as you only write to disk in larger chunks.
To put it another way, in your first loop, you are actually writing SSD_SECTOR_SIZE bytes to disk during each iteration. In your second loop you are not. You are only writing SSD_SECTOR_SIZE bytes to a memory buffer, which, depending on the size of the buffer, will only be flushed every Nth iteration.

In the _write() case, the value of SSD_SECTOR_SIZE matters. In the fwrite case, the size of each write will actually be BUFSIZ. To get a better comparison, make sure the underlying buffer sizes are the same.
However, this is probably only part of the difference.
In the fwrite case, you are measuring how fast you can get data into memory. You haven't flushed the stdio buffer to the operating system, and you haven't asked the operating system to flush its buffers to physical storage. To compare more accurately, you should call fflush() before stopping the timers.
If you actually care about getting data onto the disk rather than just getting the data into the operating systems buffers, you should ensure that you call fsync()/FlushFileBuffers() before stopping the timer.
Other obvious differences:
The drives are different. I don't know how different.
The semantics of a write to a device are different to the semantics of writes to a filesystem; the file system is allowed to delay writes to improve performance until explicitly told not to (eg. with a standard handle, a call to FlushFileBuffers()); writes directly to a device aren't necessarily optimised in that way. On the other hand, the file system must do extra I/O to manage metadata (block allocation, directory entries, etc.)
I suspect that you're seeing a different in policy about how fast things actually get on to the disk. Raw disk performance can be very fast, but you need big writes and preferably multiple concurrent outstanding operations. You can also avoid buffer copying by using the right options when you open the handle.

When to build your own buffer system for I/O (C++)?

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. This did not work, the performance is more or less the same. I have the same problem reading the files. I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).
Any idea about how to improve the performance?
PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy.

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks:
The HDD itself
The HDD interface (IDE/SATA/RAID/USB?)
Operating system/filesystem
C/C++ Library
Your code
I'd start by doing some measurements:
How long does your code take to read/write a 2GB file,
How fast can the 'dd' command read and write to disk? Example...
dd if=/dev/zero bs=1024 count=2000000 of=file_2GB
How long does it take to write/read using just big fwrite()/fread() calls
Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds.
How long is it actually taking?
Hi Roddy, using fstream read method
with 1.1 GB files and large
buffers(128,255 or 512 MB) it takes
about 43-48 seconds and it is the same
using fstream getline (line by line).
cp takes almost 2 minutes to copy the
file.
In which case, your're hardware-bound. cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. So it will (as you see) be more than twice as bad as the simple 'read' case.
To improve the speed, the first thing I'd try is a faster hard drive, or an SSD.
You haven't said what the disk interface is? SATA is pretty much the easiest/fastest option. Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound...

I would also suggest memory-mapped files but if you're going to use boost I think boost::iostreams::mapped_file is a better match than boost::interprocess.

Maybe you should look into memory mapped files.
Check them in this library : Boost.Interprocess

Just a thought, but avoid using std::endl as this will force a flush before the buffer is full. Use '\n' instead for a newline.

Don't use new to allocate the buffer like that:
Try: std::vector<>
unsigned int buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
std::vector<char> data_buffer(buffer_size);
_file->read(&data_buffer[0], buffer_size);
Also read the article on using underscore in identifier names:. Note your code is OK but.

Using getline() may be inefficient because the string buffer may need to be re-sized several times as data is appended to it from the stream buffer. You can make this more efficient by pre-sizing the string:
Also you can set the size of the iostreams buffer to either very large or NULL(for unbuffered)
// Unbuffered Accesses:
fstream file;
file.rdbuf()->pubsetbuf(NULL,0);
file.open("PLOP");
// Larger Buffer
std::vector<char> buffer(64 * 1024 * 1024);
fstream file;
file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());
file.open("PLOP");
std::string line;
line.reserve(64 * 1024 * 1024);
while(getline(file,line))
{
// Do Stuff.
}

If you are going to buffer the file yourself, then I'd advise some testing using unbuffered I/O (setvbuf on a file that you've fopened can turn off the library buffering).
Basically, if you are going to buffer yourself, you want to disable the library's buffering, as it's only going to cause you pain. I don't know if there is any way to do that for STL I/O, so I recommend going down to the C-level I/O.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js