How does fstream work? Memory or file? - c++

I'm not finding a clear answer to one aspect of the fstream object necessary to determine whether it is worth using. Does fstream store its contents in memory, or is it more like a pointer to a location in a file? I was originally using CFile and reading the text into a CString, but I'd rather not have the entire file in memory if I can avoid it.

fstream is short for file stream -- it's normally a connection to a file in the host OS's file system. (ยง27.9.1.1/1: "The class basic_filebuf<charT,traits> associates both the input sequence and the output sequence with a file.")
It does (normally) buffer some information from that file, and if you happen to be working with a tiny file, it might all happen to fit in the buffer. In a typical case, however, most of the data will be in a file on disk (or at least in the OS's file cache) with some relatively small portion of it (typically a few kilobytes) in the fstream's buffer.
If you did want to use a buffer in memory and have it act like a file, you'd normally use a std::stringstream (or a variant like std::istringstream or std::ostringstream).

Related

mmap file opened with "fopen" in C++

After many frustrating experiences with limited support of HDF5 in many computers, I decided to write my own data container to store arrays in a binary file.
Basically, the format is very simple: each variable has a small header including a variable name, number of dimensions, actual size of each dimension and variable type. The data of one variable is stored right after the header. Variables are stored one after the other.
Read/write operations of header files are conveniently done using fseek, fread, fwrite and therefore I have opened the file using fopen, which returns a FILE*.
The problem is that if I want to update part of the values of one array on disk, the cleanest way to do it is using memory mapping (in my opinion). Looking at the documentation of mmap, it is possible to mmap files opened with "open", which return an int. But my file was already opened with "fopen".
Is it possible to mmap a section of a FILE*? How?
What you are looking for is the fileno function:
int fileno(FILE *stream);
The function fileno() examines the argument stream and returns its integer file descriptor.
You can call that on your file stream and pass the result to mmap. Either that or just use open instead of fopen to get a file descriptor in the first place since parsing the header via a memory map is probably easier than using fseek and fread.

How to read huge file in c++

If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk). It is delimited by space. And my RAM is only 8GB. Can I read that file in ifstream? If not, how to read a block of file (eg. 4GB)?
There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.
I am sure you don't have to keep all the file in memory. Typically one wants to read and process file by chunks. If you want to use ifstream, you can do something like that:
ifstream is("/path/to/file");
char buf[4096];
do {
is.read(buf, sizeof(buf));
process_chunk(buf, is.gcount());
} while(is);
A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:
Under windows: CreateFileMapping(), MapViewOfFile()
Under linux: open(2) / creat(2), shm_open, mmap
you will need to compile 64bit app to make it work.
for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory
You can use fread
char buffer[size];
fread(buffer, size, sizeof(char), fp);
Or, if you want to use C++ fstreams you can use read as buratino said.
Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.

Recover object lazy-loading the containing file

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;
Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.
You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

std::ofstream::open will it read the entire file into memory?

I'm writing things from my memory to the disk in order to free my memory.
I wonder each time I call open(), and appendix new elements to the end of the file, will it read the entire file into memory? or it is just a pointer to the end of the file?
The fstream implementation doesn't specify exactly what happens if you use the ofstream::app, ios::app, ofstream::ate or ios::ate mode to open the file.
But in any sane implementation, the file is not read into memory, all that happens is that the fstream implementation positions the "current position" to the end of the file.
To read the entire file into memoiry would be rather terrible if you have a system with 2GB of RAM and you wanted to append to a file that is bigger than 2GB.
Being very pedantic, when writing something to a text-file, it is likely that the filesystem that is part of the operating system will read the last few (kilo)bytes of the file, as most hard-disks and similar storage requires that the data is being written to a "block", which is a fixed size (e.g. 512 bytes or 4 kilobytes). So, unless the current filesize is exactly at a boundary of such a block, the filesystem must read the last block of the file and write it back with the additional data that you asked to write.
If you are worried about appending to a log-file that gets very large, no, it's not an issue. If you are worried about memory safety because your file has secret data that you won't want stored in memory, then may be a problem, because a portion of that will probably be loaded into memory, and there is nothing you can do to control that.

c++ fstream and seekp

If I do a seekp past the end of a file and then a write what happens? What I would like to happen is that the file would automatically be extended to the size of the seekp location. If so, does it also fill in the space between the old end of file and the new location by writing anything to the disk or is the space just allocated.
However, if it does not extend the file, how could I accomplish fairly efficiently?
It is perfectly acceptable to seekp() past the end of file and then write. The file would indeed be extended.
Whether or not there would be disk space allocated for the hole depends on the filesystem. Some filesystems (e.g. ext3) support sparse files, some don't.