How to write a large binary file to a disk - c++

I am writing a program which requires writing a large binary file (about 12 GiB or more) to a disk. I have created a small test program to test this functionality. Although allocating the RAM memory for the buffer is not a problem, my program does not write the data to a file. The file remains empty. Even for 3.72 GiB files.
//size_t bufferSize=1000; //ok
//size_t bufferSize=100000000; //ok
size_t bufferSize=500000000; //fails although it is under 4GiB, which shouldn't cause problem anyways
double mem=double(bufferSize)*double(sizeof(double))/std::pow(1024.,3.);
cout<<"Total memory used: "<<mem<<" GiB"<<endl;
double *buffer=new double[bufferSize];
/* //enable if you want to fill the buffer with random data
printf("\r[%i \%]",0);
for (size_t i=0;i<(size_t)bufferSize;i++)
{
if ((i+1)%100==0) printf("\r[%i %]",(size_t)(100.*double(i+1)/bufferSize));
buffer[i]=rand() % 100;
}
*/
cout<<endl;
std::ofstream outfile ("largeStuff.bin",std::ofstream::binary);
outfile.write ((char*)buffer,((size_t)(bufferSize*double(sizeof(double)))));
outfile.close();
delete[] buffer;

I actually compiled and ran the code exactly as you have pasted there and it works. It creates a 4GB file.
If you are on a FAT32 filesystem the max filesize is 4GB.
Otherwise I suggest you check:
The amount of free disk space you have.
Whether your user account has any disk usage limits in place.
The amount of free RAM you have.
Whether there are any runtime errors.
#enhzflep's suggestion about the number of prints (although that is
commented out)

It seems that you want to have a buffer that contains the whole file's contents prior to writing it.
You're doing it wrong, through: the virtual memory requirements are essentially double of what they need to be. Your process retains the buffer, but when you write that buffer to disk it gets duplicated in operating system's buffers. Now, most OSes will notice that you write sequentially and may discard their buffers quickly, but still: it's rather wasteful.
Instead, you should create an empty file, grow it to its desired size, then map its view into memory, and do the modifications on the file's view in memory. For 32 bit hosts your file size is limited to <1GB. For 64 bit hosts, it's limited by the filesystem only. On modern hardware, creating and filling a 1GB file that way takes on the order of 1 second (!) if you have enough free RAM available.
Thanks to the wonders of RAII, you don't need to do anything special to release the mapped memory, or to close/finalize the file. By leveraging boost you can avoid writing platform-specific code, too.
// https://github.com/KubaO/stackoverflown/tree/master/questions/mmap-boost-40308164
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/filesystem.hpp>
#include <cassert>
#include <cstdint>
#include <fstream>
namespace bip = boost::interprocess;
void fill(const char * fileName, size_t size) {
using element_type = uint64_t;
assert(size % sizeof(element_type) == 0);
std::ofstream().open(fileName); // create an empty file
boost::filesystem::resize_file(fileName, size);
auto mapping = bip::file_mapping{fileName, bip::read_write};
auto mapped_rgn = bip::mapped_region{mapping, bip::read_write};
const auto mmaped_data = static_cast<element_type*>(mapped_rgn.get_address());
const auto mmap_bytes = mapped_rgn.get_size();
const auto mmap_size = mmap_bytes / sizeof(*mmaped_data);
assert(mmap_bytes == size);
element_type n = 0;
for (auto p = mmaped_data; p < mmaped_data+mmap_size; ++p)
*p = n++;
}
int main() {
const uint64_t G = 1024ULL*1024ULL*1024ULL;
fill("tmp.bin", 1*G);
}

Related

fstream.write() gets stuck for a long time?

fstream.write() gets stuck during perform to remove directory. It seems that my HDD is so busy while write and remove at the same time. So how can I avoid this stuck, any timeout parameter for the write() function?. Below is my procedure code:
#include <fstream>
#include <iostream>
int main()
{
std::ios_base::sync_with_stdio(false);
fstream myfile = std::fstream("sample.txt", std::ios::out | std::ios::binary);
//Start thread remove old data
INT64* paramsInput = new INT64[2];
char* dir = "D:\";
paramsInput[0] = (INT64)dir;
paramsInput[1] = 50; //GB
_beginthreadex(Null, 0, &remove_old_data, (VOID*)paramsInput);
int size = 0;
char* data = NULL;
while (true)
{
data = NULL;
size = getData(data); //data is available every 10 ms
if(size > 0 && data != NULL) //size ~= 30 KB
{
myfile.write(data, size); //write data to file
}
}
}
UINT32 __stdcall remove_old_data(VOID* _pArguments)
{
char* dir = (char*)_pArguments[0];
int freeSpaceThreshold = _pArguments[1];
delete[] _pArguments;
while(true)
{
int curFreeSpace = GetFreeSpace(dir);
if(curFreeSpace < freeSpaceThreshold )
{
//remove old files and directory here
ClearData(dir);//File size is about 10 MB, 40,000 files in dir
}
Sleep(10000);
}
}
It's a bit tough to say exactly what the bottleneck is in your case, although there could be several. Deleting files can be time consuming, particularly when doing several in a directory with many files. Writes to a mostly full drive are also slower, as it takes the OS longer to find empty space to store new data, and those chunks are smaller.
Here are some suggestions to improve the performance, in no particular order:
1) Use an SSD. This eliminates almost all the latency for hard disk access.
2) Use OS API functions for file access, and use unbuffered writes. This will avoid filling the disk cache with data that is only written and not read again, allowing the directory information to stay cached.
3) Use multiple subdirectories to store your data. File access in a directory can slow down if the directory size gets to be too large.
4) Cache the 30K data chunks locally, keeping multiple chunks queued up for writing, and only write them out when remove_old_data is not cleaning up the directory.

Memory page alignment doesn't seem to affect performance...?

I am developing an application for which performance is a fundamental issue. In particular, I was willing to organize a tree-like structure that needs to be traversed really quickly in blocks of the same size as my memory page size so that it would reduce the number of cache misses needed to reach a leaf.
I am quite a novice in the art of memory optimization. As far as I understand, the process of accessing the main memory goes more or less as follows:
CPUs have several layer of caches of increasing size and decreasing speed.
Every time some data that I need is already in the cache, it is fetched from the cache (cache hit).
If it is not in the cache, it will be fetched from the main memory.
Anytime something is loaded from the main memory, the whole page (or pages) containing the data are loaded and stored in the cache. In this way, if I try to access locations in memory that are close to the ones I already fetched from the main memory, they will already be in my CPU cache.
However, if I organize my data in blocks of the same size as my memory page size, I thought that it would also be needed to align that data properly, so that whenever a new block of my data needs to be loaded only one page of memory will need to be fetched from the main memory rather than the two pages containing the first half and the second half of my data block). In principle, shouldn't a correctly aligned data block mean only one access to the memory rather than two? Shouldn't that more or less double memory performance?
I tried the following:
#include <iostream>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
using namespace std;
#define BLOCKS 262144
#define TESTS 131072
unsigned long int utime()
{
struct timeval tp;
gettimeofday(&tp, NULL);
return tp.tv_sec * 1000000 + tp.tv_usec;
}
unsigned long int pagesize = sysconf(_SC_PAGE_SIZE);
unsigned long int block_slots = pagesize / sizeof(unsigned int);
unsigned int t = 0;
unsigned int p = 0;
unsigned long int test(unsigned int * data)
{
unsigned long int start = utime();
for(unsigned int n=0; n<TESTS; n++)
{
for(unsigned int i=0; i<block_slots; i++)
t += data[p * block_slots + i];
p = t % BLOCKS;
}
unsigned long int end = utime();
return end - start;
}
int main()
{
srand((unsigned int) time(NULL));
char * buffer = new char[(BLOCKS + 3) * pagesize];
for(unsigned int i=0; i<(BLOCKS + 3) * pagesize; i++)
buffer[i] = rand();
for(unsigned int i=0; i<pagesize; i++)
cout<<test((unsigned int *) (buffer + i))<<endl;
cout<<"("<<t<<")"<<endl;
delete [] buffer;
}
This code instantiates more or less 1 GB of empty bytes, fills them with random numbers. Then the function test is called with all the possible shifts in a memory page (from a 0 shift to a 4096 shift). The test function interprets the pointer provided as a group of blocks of data and carries out some simple operation (sum) over those blocks. The order of access to the blocks is more or less random (it's determined by the partial sums) so that every time a new block is accessed it is nearly certain not to already be in the cache.
The function test is then timed. In all the shift configurations but one I should observe some timing, while in one particular shift configuration (the null shift, maybe?) I should observe some big improvement in terms of efficiency. This, however, does not happen at all: all the shift timings are perfectly compatible with each other.
Why does this happen and what does this mean? Can I just forget about memory alignment? Can I also forget about making my data blocks exactly as big as a memory page? (I was planning to use some padding in case they were smaller). Or maybe something in the cache management process is just unclear to me?

Why does allocating large chunks of memory fail when reallocing small chunks doesn't

This code results in x pointing to a chunk of memory 100GB in size.
#include <stdlib.h>
#include <stdio.h>
int main() {
auto x = malloc(1);
for (int i = 1; i< 1024; ++i) x = realloc(x, i*1024ULL*1024*100);
while (true); // Give us time to check top
}
While this code fails allocation.
#include <stdlib.h>
#include <stdio.h>
int main() {
auto x = malloc(1024ULL*1024*100*1024);
printf("%llu\n", x);
while (true); // Give us time to check top
}
My guess is, that the memory size of your system is less than the 100 GiB that you are trying to allocate. While Linux does overcommit memory, it still bails out of requests that are way beyond what it can fulfill. That is why the second example fails.
The many small increments of the first example, on the other hand, are way below that threshold. So each one of them succeeds as the kernel knows that you didn't require any of the prior memory yet, so it has no indication that it won't be able to back those 100 additional MiB.
I believe that the threshold for when a memory request from a process fails is relative to the available RAM, and that it can be adjusted (though I don't remember how exactly).
Well you're allocating less memory in the one that succeeds:
for (int i = 1; i< 1024; ++i) x = realloc(x, i*1024ULL*1024*100);
The last realloc is:
x = realloc(x, 1023 * (1024ULL*1024*100));
As compared to:
auto x = malloc(1024 * (1024ULL*100*1024));
Maybe that's right where your memory boundary is - the last 100M that broke the camel's back?

Loop Around File Mapping Kills Performance

I have a circular buffer which is backed with file mapped memory (the buffer is in the size range of 8GB-512GB).
I am writing to (8 instances of) this memory in a sequential manner from the beginning to the end at which point it loops around back to the beginning.
It works fine until it reaches the end where it needs to perform two file mappings and loop around the memory, at which point IO performance is totally trashed and doesn't recover (even after several minutes). I can't quite figure it out.
using namespace boost::interprocess;
class mapping
{
public:
mapping()
{
}
mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
: offset_(offset)
, mode_(mode)
{
const auto aligned_size = page_ceil(size + page_size());
const auto aligned_file_size = page_floor(file_size);
const auto aligned_file_offset = page_floor(offset % aligned_file_size);
const auto region1_size = std::min(aligned_size, aligned_file_size - aligned_file_offset);
const auto region2_size = aligned_size - region1_size;
if (region2_size)
{
const auto region1_address = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address();
const auto region2_address = reinterpret_cast<char*>(region1_address) + region1_size;
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
region2_ = mapped_region(file, mode, 0, region2_size, region2_address);
}
else
{
region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
region2_ = mapped_region();
}
size_ = region1_.get_size() + region2_.get_size();
offset_ = aligned_file_offset;
}
auto offset() const -> std::size_t { return offset_; }
auto size() const -> std::size_t { return size_; }
auto data() const -> const void* { return region1_.get_address(); }
auto data() -> void* { return region1_.get_address(); }
auto flush(bool async = true) -> void
{
region1_.flush(async);
region2_.flush(async);
}
auto mode() const -> mode_t { return mode_; }
private:
std::size_t offset_ = 0;
std::size_t size_ = 0;
mode_t mode_;
mapped_region region1_;
mapped_region region2_;
};
struct loop_mapping::impl final
{
std::tr2::sys::path file_path_;
file_mapping file_mapping_;
std::size_t file_size_;
std::size_t map_size_ = page_floor(256000000ULL);
std::shared_ptr<mapping> mapping_ = std::shared_ptr<mapping>(new mapping());
std::shared_ptr<mapping> prev_mapping_;
bool write_;
public:
impl(std::tr2::sys::path path, bool write)
: file_path_(std::move(path))
, file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
, file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
, write_(write)
{
REQUIRE(file_size_ >= map_size_ * 3);
}
~impl()
{
prev_mapping_.reset();
mapping_.reset();
}
auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
{
offset = offset % page_floor(file_size_);
REQUIRE(size < file_size_ - map_size_ * 3);
const auto write = write_opt.get_value_or(write_);
REQUIRE(!write || write_);
if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
{
auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));
if (mapping_)
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
if (prev_mapping_)
prev_mapping_->flush(false);
prev_mapping_ = std::move(mapping_);
mapping_ = std::move(new_mapping);
}
return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
}
}
-
// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
auto src = get_new_data(5000000/8);
auto dst = loop.data(n * 5000000/8, 5000000/8, true);
std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
std::this_thread::sleep_for(std::chrono::seconds(1));
}
Any ideas?
Target System:
1x 3TB Seagate Constellation ES.3
2x Xeon E5-2400 (6 core, 2.6Ghz)
6x 8GB DDR3 1600Mhz ECC
Windows Server 2012
8 buffers each 8 to 512GiB in size on a system with 48GiB of physical memory means that your mapping will have to be swapped. No surprise there.
The issue, as you have already remarked yourself, is that prior to being able to write to a page, you encounter a fault, and the page is read in. That doesn't happen on the first run, since merely a zero page is used. To make matters worse, reading in pages again competes with write-behind of dirty pages.
Now, there is unluckily no way of telling Windows "I'm going to overwrite this anyway", nor is there any way of making the disk load your stuff faster. However, you can start the transfer earlier (maybe when you're 3/4 through the buffer).
Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).
That is, of course, not exactly what you want to do when you already know that you will overwrite the complete memory page (or several of them) anyway, but it is as good as you can get. It's worth a try in any case.
Ideally, you would want to do something like a destructive madvise(MADV_DONTNEED) as implemented e.g. under Linux (and I believe FreeBSD, too) immediately before you overwrite a page, but I am not aware of any way of doing this under Windows (...short of destroying the view and the mapping and mapping from scratch, but then you throw away all data, so that's a bit useless).
Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.
Another "obvious" (but probably not that easy) solution would be to make the consumer faster. That would allow for a smaller buffer to begin with, and even on a huge buffer it would keep the working set smaller (both producer and consumer force pages into RAM while accessing them, so if the consumer accesses data with less delay after the producer has written them, they will both be using mostly the same set of pages.) Smaller working sets fit into RAM more easily.
But I realize that you probably didn't choose a several-gigabyte buffer for no reason.
Since your code is devoid of any comment, filled with auto variables, not compilable as is and I don't have 512Gb available on my PC to test it anyway, this will remain a passing tought off the top of my head.
each of your process only writes a few hundreds Kb/s, so there should be ample time to flush that to disk in the background.
However, it seems you are asking the boost mapping system to flush either synchronously or asynchronously the previous chunk depending on your mysterious offset computations:
mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));
I guess the rollover triggers a synchronous flush, which is a likely culprit for the sudden slowdown.
What the operating system does at this point depends on the boost implementation, which is not described (or at least in a way obvious enough for me to get it after a cursory look at their man page).
If boost stuffed your 48 Gb of memory with unflushed pages, you could certainly experience a sudden and prolonged deceleration.
At least worth a comment in your code if this mysterious line does something clever and completely different I missed entirely.
If you are able to back the memory mapping with the page file rather than a specific file, you can use the MEM_RESET flag with VirtualAlloc to prevent Windows from paging in the old contents.
The main issue I would anticipate in using this approach is that you can't easily recover the disk space when you are done. It may also require the system's page file settings to be changed; I believe it will work with the default settings, but not if a maximum page file size has been set.
I am going to assume that by "Loop around" you mean that the RAM got full.
What happens is that until the RAM get full, all you have to do is allocate a page and write in it (RAM speed), after the RAM gets full every page allocation turns to 2 actions:
1. you have to write the dirty page back (DISK speed)
2. and allocate a page (RAM speed)
And worst case you also have to bring the page from the file in the disk (DISK speed) if you are reading something from it.
So instead of working only in RAM speed (page allocation), every page allocation runs in DISK speed.
This doesnt happen with 2x8GB because it is small enough for all of the memory of both files to remain fully in the RAM.
The problem here it turns out is that when overwrite a valid page in memory the page first has to be read from the drive before being overwritten. There is no way to get around this issue as far as I know when using memory mapped files.
The reason it doesn't happen during the first pass is that the pages being overwritten are not "valid" and thus they do not need to be read back.

Why this app doesn't consume as much memory as expected

I wrote a simple application to test memory consumption. In this test application, I created four processes to continually consume memory, those processes won't release the memory unless the process exits.
I expected this test application to consume the most memory of RAM and cause the other application to slow down or crash. But the result is not the same as expected. Below is the code:
#include <stdio.h>
#include <unistd.h>
#include <list>
#include <vector>
using namespace std;
unsigned short calcrc(unsigned char *ptr, int count)
{
unsigned short crc;
unsigned char i;
//high cpu-consumption code
//implements the CRC algorithm
//CRC is Cyclic Redundancy Code
}
void* ForkChild(void* param){
vector<unsigned char*> MemoryVector;
pid_t PID = fork();
if (PID > 0){
const int TEN_MEGA = 10 * 10 * 1024 * 1024;
unsigned char* buffer = NULL;
while(1){
buffer = NULL;
buffer = new unsigned char [TEN_MEGA];
if (buffer){
try{
calcrc(buffer, TEN_MEGA);
MemoryVector.push_back(buffer);
} catch(...){
printf("An error was throwed, but caught by our app!\n");
delete [] buffer;
buffer = NULL;
}
}
else{
printf("no memory to allocate!\n");
try{
if (MemoryVector.size()){
buffer = MemoryVector[0];
calcrc(buffer, TEN_MEGA);
buffer = NULL;
} else {
printf("no memory ever allocated for this Process!\n");
continue;
}
} catch(...){
printf("An error was throwed -- branch 2,"
"but caught by our app!\n");
buffer = NULL;
}
}
} //while(1)
} else if (PID == 0){
} else {
perror("fork error");
}
return NULL;
}
int main(){
int children = 4;
while(--children >= 0){
ForkChild(NULL);
};
while(1) sleep(1);
printf("exiting main process\n");
return 0;
}
TOP command
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2775 steve 20 0 1503m 508 312 R 99.5 0.0 1:00.46 test
2777 steve 20 0 1503m 508 312 R 96.9 0.0 1:00.54 test
2774 steve 20 0 1503m 904 708 R 96.6 0.0 0:59.92 test
2776 steve 20 0 1503m 508 312 R 96.2 0.0 1:00.57 test
Though CPU is high, but memory percent remains 0.0. How can it be possible??
Free command
free shared buffers cached
Mem: 3083796 0 55996 428296
Free memory is more than 3G out of 4G RAM.
Does there anybody know why this test app just doesn't work as expected?
Linux uses optimistic memory allocation: it will not physically allocate a page of memory until that page is actually written to. For that reason, you can allocate much more memory than what is available, without increasing memory consumption by the system.
If you want to force the system to allocate (commit) a physical page , then you have to write to it.
The following line does not issue any write, as it is default-initialization of unsigned char, which is a no-op:
buffer = new unsigned char [TEN_MEGA];
If you want to force a commit, use zero-initialization:
buffer = new unsigned char [TEN_MEGA]();
To make the comments into an answer:
Linux will not allocate memory pages for a process until it writes to them (copy-on-write).
Additionally, you are not writing to your buffer anywhere, as the default constructor for unsigned char does not perform any initializations, and new[] default-initializes all items.
fork() returns the PID in the parent, and 0 in the child. Your ForkChild as written will execute all the work in the parent, not the child.
And the standard new operator will never return null; it will throw if it fails to allocate memory (but due to overcommit it won't actually do that either in Linux). This means your test of buffer after the allocation is meaningless: it will always either take the first branch or never reach the test. If you want a null return, you need to write new (std::nothrow) .... Include <new> for that to work.
But your program is infact doing what you expected it to do. As an answer has pointed out (# Michael Foukarakis's answer), memory not used is not allocated. In your output of the top program, I noticed that the column virt had a large amount of memory on it for each process running your program. A little googling later, I saw what this was:
VIRT -- Virtual Memory Size (KiB). The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out and pages that have been mapped but not used.
So as you can see, your program does in fact generate memory for itself, but in the form of pages and stored as virtual memory. And I think that is a smart thing to do
A snippet from this wiki page
A page, memory page, or virtual page -- a fixed-length contiguous block of virtual memory, and it is the smallest unit of data for the following:
memory allocation performed by the operating system for a program; and
transfer between main memory and any other auxiliary store, such as a hard disk drive.
...Thus a program can address more (virtual) RAM than physically exists in the computer. Virtual memory is a scheme that gives users the illusion of working with a large block of contiguous memory space (perhaps even larger than real memory), when in actuality most of their work is on auxiliary storage (disk). Fixed-size blocks (pages) or variable-size blocks of the job are read into main memory as needed.
Sources:
http://www.computerhope.com/unix/top.htm
https://stackoverflow.com/a/18917909/2089675
http://en.wikipedia.org/wiki/Page_(computer_memory)
If you want to gobble up a lot of memory:
int mb = 0;
char* buffer;
while (1) {
buffer = malloc(1024*1024);
memset(buffer, 0, 1024*1024);
mb++;
}
I used something like this to make sure the file buffer cache was empty when taking some file I/O timing measurements.
As other answers have already mentioned, your code doesn't ever write to the buffer after allocating it. Here memset is used to write to the buffer.