Calling fallocate() on a C++ ofstream

Calling fallocate() on a C++ ofstream - c++

I am writing a large binary output buffer through ofstream::write(). Since I know the size of the output file, but sometimes have to write it in chunks, I thought it would be a good idea to call fallocate() (or posix_fallocate()) first to preallocate the buffer on disk. Those do, however, require a file descriptor, which ofstream does not provide me with.
Is there an ofstream interface for calling fallocate(), or possibly to get the underlying file descriptor so that I can call it myself? (Or is it not worth the bother?)

Since you going to write in chunks use fwrite
also see http://www.cplusplus.com/reference/cstdio/setvbuf/ to control buffer size
to optimize you can have buffer size = N * chunk size

Related

How can I read/write a fixed number of ints from/to file in one operation (as fast as I can, file can be assumed to be binary)?

I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it. In the process I need to read/write a big amount of numbers from/to the file (from/to vector<int> or int[]) quickly, so I'd like not to read/write it one by one, but read/write it by blocks with a fixed size. How can I do it?

I have a big file (let's assume I can make it binary), that can not fit in RAM, and I want to sort numbers from it.
Given that the file is binary, perhaps the simplest and presumably efficient solution is to memory map the file. Unfortunately there is no standard interface to perform memory mapping. On POSIX systems, there is the mmap function.
Now, the memory mapped file is simply an array of raw bytes. Treating it as an array of integers is technically not allowed until C++20 where C-style "implicit creation of low level objects" is introduced. In practice, that already works on most current language implementations Note 1.
For this reinterpretation to work, the representation of the integers in the file must match the representation of integers used by the CPU. The file will not be portable to the same program running on other, incompatible systems.
We can simply use std::sort on this array. The operating system should take care of paging the file in and out of memory. The algorithm used by std::sort isn't necessarily optimised for this use case however. To find the optimal algorithm, you may need to do some research.
1 In case Pre-C++20 standard conformance is a concern, it is possible to iterate over the array, copy the underlying bytes into an integer, placement-new an integer object into the memory using the copied integer as the new value. A compiler can optimise these operations into zero instructions, and this makes the program's behaviour well defined.

You can use ostream::write to write into a file and istream::read to read from a file.
To make the process clean, it will be good to have the number of items also in the file.
Let's say you have a vector<int>.
You can use the following code to write its contents to a file.
std::vector<int> myData;
// .. Fill up myData;
// Open a file to write to, in binary mode.
std::ofstream out("myData.bin", std::ofstream::binary);
// Write the size first.
auto size = myData.size();
out.write(reinterpret_cast<char const*>(&size), sizeof(size));
// Write the data.
out.write(reinterpret_cast<char const*>(myData.data()), sizeof(int)*size);
You can read the contents of such a file using the following code.
std::vector<int> myData;
// Open the file to read from, in binary mode.
std::ifstream in("myData.bin", std::ifstream::binary);
// Read the size first.
auto size = myData.size();
in.read(reinterpret_cast<char*>(&size), sizeof(size));
// Resize myData so it has enough space to read into.
myData.resize(size);
// Read the data.
in.read(reinterpret_cast<char*>(myData.data()), sizeof(int)*size);
If not all of the data can fit into RAM, you can read and write the data in smaller chunks. However, if you read/write them in smaller chunks, I don't know how you would sort them.

Write compressed data to file using fwrite C++

I have a buffer of size 42131221 bytes (42.1MB) that I used to store some compressed data. Only the first 20MB actually store the compressed data and I am trying to write this to a file using fwrite:
fwrite (buffer , WHAT_GOES_HERE, buffer_length, pFile);
The second parameter requires the size of each element to write, but this isn't applicable in this case as I just want to write the whole buffer and since it's compressed, it isn't the case that there exists a size of each element.
Any idea on how I can get this to work?

WHAT_GOES_HERE should be sizeof(type of the buffer). Also, the buffer_length should be the number of "types" you want to write to the file. I mention this since it seems you do not want to write the entire buffer but only the for 20MB.

fwrite works on streams, which are buffered.
write is a lower-level API based on file descriptors. It doesn't know about buffering.
Rule of thumb is,
If you want to write a single large buffer, go for write.
You use fwrite if you want to write in smaller chunks.
so, you could go for write here.

How to read blocks of data from a file and then read from that block into a vector?

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.

You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

Reading text files

I'm trying to find out what is the best way to read large text (at least 5 mb) files in C++, considering speed and efficiency. Any preferred class or function to use and why?
By the way, I'm running on specifically on UNIX environment.

The stream classes (ifstream) actually do a good job; assuming you're not restricted otherwise make sure to turn off sync_with_stdio (in ios_base::). You can use getline() to read directly into std::strings, though from a performance perspective using a fixed buffer as a char* (vector of chars or old-school char[]) may be faster (at a higher risk/complexity).
You can go the mmap route if you're willing to play games with page size calculations and the like. I'd probably build it out first using the stream classes and see if it's good enough.
Depending on what you're doing with each line of data, you might start finding your processing routines are the optimization point and not the I/O.

Use old style file io.
fopen the file for binary read
fseek to the end of the file
ftell to find out how many bytes are in the file.
malloc a chunk of memory to hold all of the bytes + 1
set the extra byte at the end of the buffer to NUL.
fread the entire file into memory.
create a vector of const char *
push_back the address of the first byte into the vector.
repeatedly
strstr - search the memory block for the carriage control character(s).
put a NUL at the found position
move past the carriage control characters
push_back that address into the vector
until all of the text in the buffer has been processed.
----------------
use the vector to find the strings,
and process as needed.
when done, delete the memory block
and the vector should self-destruct.

If you are using text file storing integers, floats and small strings, my experience is that FILE, fopen, fscanf are already fast enough and also you can get the numbers directly. I think memory mapping is the fastest, but it requires you to write code to parse the file, which needs extra work.

Which is faster, writing raw data to a drive, or writing to a file?

I need to write data into drive. I have two options:
write raw sectors.(_write(handle, pBuffer, size);)
write into a file (fwrite(pBuffer, size, count, pFile);)
Which way is faster?
I expected the raw sector writing function, _write, to be more efficient. However, my test result failed! fwrite is faster. _write costs longer time.
I've pasted my snippet; maybe my code is wrong. Can you help me out? Either way is okay by me, but I think raw write is better, because it seems the data in the drive is encrypted at least....
#define SSD_SECTOR_SIZE 512
int g_pSddDevHandle = _open("\\\\.\\G:",_O_RDWR | _O_BINARY, _S_IREAD | _S_IWRITE);
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
_write(g_pSddDevHandle,szMemZero,SSD_SECTOR_SIZE);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();
FILE * file = fopen("f:\\test.tmp","a+");
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
fwrite(szMemZero,SSD_SECTOR_SIZE,1,file);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();

Probably because a direct write isn't buffered. When you call fwrite, you are doing buffered writes which tend to be faster in most situations. Essentially, each FILE* handler has an internal buffer which is flushed to disk periodically when it becomes full, which means you end up making less system calls, as you only write to disk in larger chunks.
To put it another way, in your first loop, you are actually writing SSD_SECTOR_SIZE bytes to disk during each iteration. In your second loop you are not. You are only writing SSD_SECTOR_SIZE bytes to a memory buffer, which, depending on the size of the buffer, will only be flushed every Nth iteration.

In the _write() case, the value of SSD_SECTOR_SIZE matters. In the fwrite case, the size of each write will actually be BUFSIZ. To get a better comparison, make sure the underlying buffer sizes are the same.
However, this is probably only part of the difference.
In the fwrite case, you are measuring how fast you can get data into memory. You haven't flushed the stdio buffer to the operating system, and you haven't asked the operating system to flush its buffers to physical storage. To compare more accurately, you should call fflush() before stopping the timers.
If you actually care about getting data onto the disk rather than just getting the data into the operating systems buffers, you should ensure that you call fsync()/FlushFileBuffers() before stopping the timer.
Other obvious differences:
The drives are different. I don't know how different.
The semantics of a write to a device are different to the semantics of writes to a filesystem; the file system is allowed to delay writes to improve performance until explicitly told not to (eg. with a standard handle, a call to FlushFileBuffers()); writes directly to a device aren't necessarily optimised in that way. On the other hand, the file system must do extra I/O to manage metadata (block allocation, directory entries, etc.)
I suspect that you're seeing a different in policy about how fast things actually get on to the disk. Raw disk performance can be very fast, but you need big writes and preferably multiple concurrent outstanding operations. You can also avoid buffer copying by using the right options when you open the handle.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js