Reading large files in c++

Reading large files in c++ - c++

Suppose I have a .txt file the size of 9 GB. And I only want to read the n'th MB. (I know what is n). But my computer has only 4GB of ram so I can't load all the file at once. I need to access different n's multiple times. What's the best way to do this (I don't know if the standart ifstream is able of doing this stuff).

You want to "seek" in the file to a specified location. In C++ using ifstream you use seekg(): http://www.cplusplus.com/reference/istream/istream/seekg/
For example:
char data[1024*1024];
ifstream in("myfile.txt");
in.seekg(450 * 1024 * 1024, ios_base::beg); // skip 450 MB
if (in.read(data, sizeof(data)) {
// use data
}

Is your OS 64-bit? If so, try mmap().
On modern operating systems, it is possible to mmap (pronounced
“em-map”) a file to a region of memory. When this is done, the file
can be accessed just like an array in the program.
This is more efficient than read or write, as only the regions of the
file that a program actually accesses are loaded. Accesses to
not-yet-loaded parts of the mmapped region are handled in the same way
as swapped out pages.
Since mmapped pages can be stored back to their file when physical
memory is low, it is possible to mmap files orders of magnitude larger
than both the physical memory and swap space. The only limit is
address space. The theoretical limit is 4GB on a 32-bit machine -
however, the actual limit will be smaller since some areas will be
reserved for other purposes. If the LFS interface is used the file
size on 32-bit systems is not limited to 2GB (offsets are signed which
reduces the addressable area of 4GB by half); the full 64-bit are
available.
Memory mapping only works on entire pages of memory. Thus, addresses
for mapping must be page-aligned, and length values will be rounded
up.
More info:
https://www.safaribooksonline.com/library/view/linux-system-programming/0596009585/ch04s03.html
https://www.gnu.org/software/libc/manual/html_node/Memory_002dmapped-I_002fO.html

Related

Can I create an array exceeding RAM, if I have enough swap memory?

Let's say I have 8 Gigabytes of RAM and 16 Gigabytes of swap memory. Can I allocate a 20 Gigabyte array there in C? If yes, how is it possible? What would that memory layout look like?

[linux] Can I create an array exceeding RAM, if I have enough swap memory?
Yes, you can. Note that accessing swap is veerry slooww.
how is it possible
Allocate dynamic memory. The operating system handles the rest.
How would that memory layout look like?
On an amd64 system, you can have 256 TiB of address space. You can easily fit a contiguous block of 8 GiB in that space. The operating system divides the virtual memory into pages and copies the pages between physical memory and swap space as needed.

Modern operating systems use virtual memory. In Linux and most other OSes rach process has it's own address space according to the abilities of the architecture. You can check the size of the virtual address space in /proc/cpuinfo. For example you may see:
address sizes : 43 bits physical, 48 bits virtual
This means that virtual addresses use 48 bit. Half of that is reserved for the kernel so you only can use 47 bit, or 128TiB. Any memory you allocate will be placed somewhere in those 128 TiB of address space as if you actually had that much memory.
Linux uses demand page loading and per default over commits memory. When you say
char *mem = (char*)malloc(1'000'000'000'000);
what happens is that Linux picks a suitable address and just records that you have allocated 1'000'000'000'000 (rounded up to the nearest page) of memory starting at that point. (It does some sanity check that the amount isn't totally bonkers depending on the amount of physical memory that is free, the amount of swap that is free and the overcommit setting. Per default you can allocate a lot more than you have memory and swap.)
Note that at this point no physical memory and no swap space is connected to your allocated block at all. This changes when you first write to the memory:
mem[4096] = 0;
At this point the program will page fault. Linux checks the address is actually something your program is allowed to write to, finds a physical page and map it to &mem[4096]. Then it lets the program retry to write there and everything continues.
If Linux can't find a physical page it will try to swap something out to make a physical page available for your programm. If that also fails your program will receive a SIGSEGV and likely die.
As a result you can allocate basically unlimited amounts of memory as long as you never write to more than the physical memory and swap and support. On the other hand if you initialize the memory (explicitly or implicitly using calloc()) the system will quickly notice if you try to use more than available.

You can, but not with a simple malloc. It's platform-dependent.
It requires an OS call to allocate swapable memory (it's VirtualAlloc on Windows, for example, on Linux it should be mmap and related functions).
Once it's done, the allocated memory is divided into pages, contiguous blocks of fixed size. You can lock a page, therefore it will be loaded in RAM and you can read and modify it freely. For old dinosaurs like me, it's exactly how EMS memory worked under DOS... You address your swappable memory with a kind of segment:offset method: first, you divide your linear address by the page size to find which page is needed, then you use the remainder to get the offset within this page.
Once unlocked, the page remains in memory until the OS needs memory: then, an unlocked page will be flushed to disk, in swap, and discarded in RAM... Until you lock (and load...) it again, but this operation may requires to free RAM, therefore another process may have its unlocked pages swapped BEFORE your own page is loaded again. And this is damnly SLOOOOOOW... Even on a SSD!
So, it's not always a good thing to use swap. A better way is to use memory mapped files - perfect for reading very big files mostly sequentially, with few random accesses - if it can suits your needs.

How to read huge file in c++

If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?

There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.

I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:
ifstream is("/path/to/file");
char buf[4096];
do {
is.read(buf, sizeof(buf));
process_chunk(buf, is.gcount());
} while(is);

A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:
Under windows: CreateFileMapping(), MapViewOfFile()
Under linux: open(2) / creat(2), shm_open, mmap
you will need to compile 64bit app to make it work.
for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory

You can use fread
char buffer[size];
fread(buffer, size, sizeof(char), fp);
Or, if you want to use C++ fstreams you can use read as buratino said.
Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.

You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

why does dynamic memory allocation fail after 600MB?

i implemented a bloom filter(bit table) using three dimension char array it works well until it reaches at a point where it can no more allocate memory and gives a bad_alloc message. It gives me this error on the next expand request after allocating 600MB.
The bloom filter(the array) is expected to grow as big as 8 to 10GB.
Here is the code i used to allocate(expand) the bit table.
unsigned char ***bit_table_=0;
unsigned int ROWS_old=5;
unsigned int EXPND_SIZE=5;
void expand_bit_table()
{
FILE *temp;
temp=fopen("chunk_temp","w+b");
//copy old content
for(int i=0;i<ROWS_old;++i)
for(int j=0;j<ROWS;++j)
fwrite(bit_table_[i][j],COLUMNS,1,temp);
fclose(temp);
//delete old table
chunk_delete_bit_table();
//create expanded bit table ==> add EXP_SIZE more rows
bit_table_=new unsigned char**[ROWS_old+EXPND_SIZE];
for(int i=0;i<ROWS_old+EXPND_SIZE;++i)
{
bit_table_[i]=new unsigned char*[ROWS];
for(int k=0;k<ROWS;++k)
bit_table_[i][k]=new unsigned char[COLUMNS];
}
//copy back old content
temp=fopen("chunk_temp","r+b");
for(int i=0;i<ROWS_old;++i)
{
fread(bit_table_[i],COLUMNS*ROWS,1,temp);
}
fclose(temp);
//set remaining content of bit_table_to 0
for(int i=ROWS_old;i<ROWS_old+EXPND_SIZE;++i)
for(int j=0;j<ROWS;++j)
for(int k=0;k<COLUMNS;++k)
bit_table_[i][j][k]=0;
ROWS_old+=EXPND_SIZE;
}
What is the maximum allowable size for an array and if this is not the issue what can i do about it.
EDIT:
It is developed using a 32 bit platform.
It is run on 64 bit platform(server) with 8GB RAM.

A 32-bit program must allocate memory from the virtual memory address space. Which stores chunks of code and data, memory is allocated from the holes between them. Yes, the maximum you can hope for is around 650 megabytes, the largest available hole. That goes rapidly down from there. You can solve it by making your data structure smarter, like a tree or list instead of one giant array.
You can get more insight in the virtual memory map of your process with the SysInternals' VMMap utility. You might be able to change the base address of a DLL so it doesn't sit plumb in the middle of an otherwise empty region of the address space. Odds that you'll get much beyond 650 MB are however poor.
There's a lot more breathing room on a 64-bit operating system, a 32-bit process has a 4 gigabyte address space since the operating system components run in 64-bit mode. You have to use the /LARGEADDRESSAWARE linker option to allow the process to use it all. Still, that only works on a 64-bit OS, your program is still likely to bomb on a 32-bit OS. When you really need that much VM, the simplest approach is to just make a 64-bit OS a prerequisite and build your program targeting x64.

A 32-bit machine gives you a 4GB address space.
The OS reserves some of this (half of it by default on Windows, giving you 2GB to yourself. I'm not sure about Linux, but I believe it reserves 1GB)
This means you have 2-3 GB to your own process.
Into this space, several things need to fit:
your executable (as well as all dynamically linked libraries) are memory-mapped into it
each thread needs a stack
the heap
and quite a few other nitty gritty bits.
The point is that it doesn't really matter how much memory you end up actually using. But a lot of different pieces have to fit into this memory space. And since they're not packed tightly into one end of it, they fragment the memory space. Imagine, for simplicity, that your executable is mapped into the middle of this memory space. That splits your 3GB into two 1.5GB chunks. Now say you load two dynamic libraries, and they subdivide those two chunks into four 750MB ones. Then you have a couple of threads, each needing further chunks of memory, splitting up the remaining areas further. Of course, in reality each of these won't be placed at the exact center of each contiguous block (that'd be a pretty stupid allocation strategy), but nevertheless, all these chunks of memory subdivide the available memory space, cutting it up into many smaller pieces.
You might have 600MB memory free, but you very likely won't have 600MB of contiguous memory available. So where a single 600MB allocation would almost certainly fail, six 100MB allocations may succeed.
There's no fixed limit on how big a chunk of memory you can allocate. The answer is "it depends". It depends on the precise layout of your process' memory space. But on a 32-bit machine, you're unlikely to be able to allocate 500MB or more in a single allocation.

The maximum in-memory data a 32-bit process can access is 4GB in theory (in practice it will be somewhat smaller). So you cannot have 10GB data in memory at once (even with the OS supporting more). Also, even though you are allocating the memory dynamically, the free store available is further limited by the stack size.
The actual memory available to the process depends on the compiler settings that generates the executable.
If you really do need that much, consider persisting (parts of) the data in the file system.

Create too large array in C++, how to solve?

Recently, I work in C++ and I have to create a array[60.000][60.000]. However, i cannot create this array because it's too large. I tried float **array or even static float array but nothing is good. Does anyone have an ideas?
Thanks for your helps!

A matrix of size 60,000 x 60,000 has 3,600,000,000 elements.
You're using type float so it becomes:
60,000 x 60,000 * 4 bytes = 14,400,000,000 bytes ~= 13.4 GB
Do you even have that much memory in your machine?
Note that the issue of stack vs heap doesn't even matter unless you have enough memory to begin with.
Here's a list of possible problems:
You don't have enough memory.
If the matrix is declared globally, you'll exceed the maximum size of the binary.
If the matrix is declared as a local array, then you will blow your stack.
If you're compiling for 32-bit, you have far exceeded the 2GB/4GB addressing limit.

Does "60.000" actually mean "60000"? If so, the size of the required memory is 60000 * 60000 * sizeof(float), which is roughly 13.4 GB. A typical 32-bit process is limited to only 2 GB, so it is clear why it doesn't fit.
On the other hand, I don't see why you shouldn't be able to fit that into a 64-bit process, assuming your machine has enough RAM.

Allocate the memory at runtime -- consider using a memory mapped file as the backing. Like everyone says, 14 gigs is a lot of memory. But it's not unreasonable to find a computer with 14GB of memory, nor is it unreasonable to page the memory as necessary.
With a matrix of this size, you will likely become very curious about memory access performance. Remember to consider the cache grain of your target architecture and if your target has a TLB you may be able to use larger pages to relieve some TLB pressure. Then again, if you don't have enough memory you'll likely care only about how fast your storage I/O is.
If it's not already obvious, you'll need an architecture that supports a 64-bit address space in order to access this memory directly/conveniently.

To initialise the 2D array of floats that you want, you will need:
60000 * 60000 * 4 bytes = 14400000000 bytes
Which is approximately 14GB of memory. That's a LOT of memory. To even hold that theoretically, you will need to be running a 64bit machine, not to mention one with quite a bit of RAM installed.
Furthermore, allocating this much memory is almost never necessary in most situations, are you sure no optimisations could be made here?
EDIT:
In light of new information from your comments on other answers: You only have 4GB memory (RAM). Your operating system is hence going to have to page at least 9GB on the Hard Drive, in reality probably more. But you also only have 20GB of Hard Drive space. This is barely enough to page all that data, especially if the disk is fragmented. Finally, (I could be wrong because you haven't stated explicitly) it is quite possible that you're running a 32bit machine. This isn't really capable of handling more than 4GB of memory at a time.

I had this problem too. I did a workaround where I chopped the array into sections (my biggest allowed array was float A_sub_matrix_20[62944560]). When I declared just one of these in main(), it seems to be put in RAM as I got a runtime exception as soon as main() starts. I was able to declare 20 buffers of that size as global variables which works (looks like in global form they are stored on the HDD - when I added A_sub_matrix_20[n] to the watch list in VisualStudio it gave a message "reading from file").

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js