Windows shared memory access time slow

Windows shared memory access time slow - c++

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.

I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Related

C++ _popen() windows leaks paged pool memory

Main application runs in Windows service and that process starts other c++ console processes but all console modes are hidden, i.e. parent process is Windows Service and child processes are non-console applications.
Observed paged pool memory of the system is increasing during call _popen() on the customer system windows server 2016. The application runs clean on our lab system same OS.
From the Windows Performance tool xperf, captured the logs and check the call stack.
attached the pic for reference.
void CMachine::GetJavaVersion()
{
m_stJavaVersion.m_strName = " Java version";
CPUChar strVersion[64] = { 0 };
BOOL bFound = CheckJREVersion(strVersion, 64);
BYTE bytColorSt = RED;
string strRemark;
FILE *fp = NULL;
char version[130] = { 0 };
BOOL bFoundVersion = FALSE;
fp = _popen("java -version 2>&1", "r");
while (fp && fgets(version, sizeof version, fp))
{
string strTmp = version;
if (strTmp.find("version") != string::npos)
{
bFoundVersion = TRUE;
break;
}
}
if(fp) _pclose(fp);
....
PoolMon trace
Memory:33401164K Avail:30057324K PageFlts: 92362 InRam Krnl:20212K P:776328K
Commit:3228052K Limit:37595468K Peak:4747992K Pool N:182820K P:782568K
System pool information
Tag Type Allocs Frees Diff Bytes Per Alloc
Toke Paged 10546816 ( 390) 10319712 ( 382) 227104 324868080 ( 11392) 1430
CM31 Paged 42886 ( 0) 20849 ( 0) 22037 101154816 ( 0) 4590
SeAt Paged 44678436 (1662) 43769798 (1630) 908638 87253680 ( 3072) 96
QINi Paged 234 ( 0) 1 ( 0) 233 60293216 ( 0) 258769
MmSt Paged 2683066 ( 79) 2670922 ( 83) 12144 27223856 ( 3312) 2241
PoolMon

Eric Lippert writes about benchmark mistakes. I think mistake #1 applies to your case:
Mistake #1: Choosing a bad metric.
Why do you measure "paged pool" to determine a memory leak?
Paged memory is the memory that is swapped out to disk. This happens because the physical RAM is needed for something else. What is the physical RAM needed for? Probably for running the process that you start.
Once the memory is swapped to disk, it may take a while until it is swapped back to RAM. That will happen just when some other application tries to access the memory - and that may be minutes, if ever.
I also tend to say that memory isn't leaked during a method call but after a method call. After the method call, all variables should be destroyed and the related resources should be released.
If you are told that the paged pool is the cause, then ask for proof.
On my Windows 10 system, the paged pool limit is 17 GB. This can be shown by Process Explorer in View/System Information with Symbols configured.
If you're running java -version so often that it leaks 17 GB of kernel memory, then something is seriously wrong. Of course there will be a pipe or something to redirect the output from Java to your application so you can read the stream. There will also be other kernel objects like a process, a thread etc.
Even with 1 kB of kernel memory leak for each call, you would need to call that 17 million times to exhaust the paged pool. If that's the case, maybe you should consider caching the result anyway. It should be unlikely that server admins install and uninstall Java 17 million times in a few days.
For monitoring the paged pool, you can try Poolmon with /p /P command line parameters. Poolmon is part of the WDK.
Problems in your code:
Your code has at least 2 problems:
if "version" never appears in the output, your code might run in an endless loop. How could that happen? It's unlikely, but if I rename my HelloWorld.exe to java.exe, it could.
if "version" appears in the output but accidentally "ver" is in the first buffer and "sion" is in the second buffer, you'll never find out it actually was there. Your code could run into an endless loop.

Why is my C++ disk write test much slower than a simply file copy using bash?

Using below program I try to test how fast I can write to disk using std::ofstream.
I achieve around 300 MiB/s when writing a 1 GiB file.
However, a simple file copy using the cp command is at least twice as fast.
Is my program hitting the hardware limit or can it be made faster?
#include <chrono>
#include <iostream>
#include <fstream>
char payload[1000 * 1000]; // 1 MB
void test(int MB)
{
// Configure buffer
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != MB; ++i)
{
of.write(payload, sizeof(payload));
}
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Payload=" << MB << "MB Speed=" << megabytes_per_s << "MB/s" << std::endl;
}
int main()
{
for (auto i = 1; i <= 10; ++i)
{
test(i * 100);
}
}
Output:
Payload=100MB Speed=3792.06MB/s
Payload=200MB Speed=1790.41MB/s
Payload=300MB Speed=1204.66MB/s
Payload=400MB Speed=910.37MB/s
Payload=500MB Speed=722.704MB/s
Payload=600MB Speed=579.914MB/s
Payload=700MB Speed=499.281MB/s
Payload=800MB Speed=462.131MB/s
Payload=900MB Speed=411.414MB/s
Payload=1000MB Speed=364.613MB/s
Update
I changed from std::ofstream to fwrite:
#include <chrono>
#include <cstdio>
#include <iostream>
char payload[1024 * 1024]; // 1 MiB
void test(int number_of_megabytes)
{
FILE* file = fopen("test.file", "w");
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != number_of_megabytes; ++i)
{
fwrite(payload, 1, sizeof(payload), file );
}
fclose(file); // TODO: RAII
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Size=" << number_of_megabytes << "MiB Duration=" << long(0.5 + 100 * elapsed_ns/1e9)/100.0 << "s Speed=" << megabytes_per_s << "MiB/s" << std::endl;
}
int main()
{
test(256);
test(512);
test(1024);
test(1024);
}
Which improves the speed to 668MiB/s for a 1 GiB file:
Size=256MiB Duration=0.4s Speed=2524.66MiB/s
Size=512MiB Duration=0.79s Speed=1262.41MiB/s
Size=1024MiB Duration=1.5s Speed=664.521MiB/s
Size=1024MiB Duration=1.5s Speed=668.85MiB/s
Which is just as fast as dd:
time dd if=/dev/zero of=test.file bs=1024 count=0 seek=1048576
real 0m1.539s
user 0m0.001s
sys 0m0.344s

First, you're not really measuring the disk writing speed, but (partly) the speed of writing data to the OS disk cache. To really measure the disk writing speed, the data should be flushed to disk before calculating the time. Without flushing there could be a difference depending on the file size and the available memory.
There seems to be something wrong in the calculations too. You're not using the value of MB.
Also make sure the buffer size is a power of two, or at least a multiple of the disk page size (4096 bytes): char buffer[32 * 1024];. You might as well do that for payload too. (looks like you changed that from 1024 to 1000 in an edit where you added the calculations).
Do not use streams to write a (binary) buffer of data to disk, but instead write directly to the file, using FILE*, fopen(), fwrite(), fclose(). See this answer for an example and some timings.
To copy a file: open the source file in read-only and, if possible, forward-only mode, and using fread(), fwrite():
while fread() from source to buffer
fwrite() buffer to destination file
This should give you a speed comparable to the speed of an OS file copy (you might want to test some different buffer sizes).
This might be slightly faster using memory mapping:
open src, create memory mapping over the file
open/create dest, set file size to size of src, create memory mapping over the file
memcpy() src to dest
For large files smaller mapped views should be used.

Streams are slow
cp uses syscalls directly read(2) or mmap(2).

I'd wager that it's something clever inside either CP or the filesystem. If it's inside CP then it might be that the file that you are copying has a lot of 0s in it and cp is detecting this and writing a sparse version of your file. The man page for cp says "By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well." This could mean a few things but one of them is that cp could make a sparse version of your file which would require less disk write time.
If it's within your filesystem then it might be Deduplication.
As a long-shot 3rd, it might also be something within your OS or your disk firmware that is translating the read and write into some specialized instruction that doesn't require as much synchronization as your program requires (lower bus use means less latency).

You're using a relatively small buffer size. Small buffers mean more operations per second, which increases overhead. Disk systems have a small amount of latency before they receive the read/write request and begin processing it; a larger buffer amortizes that cost a little better. A smaller buffer may also mean that the disk is spending more time seeking.
You're not issuing multiple simultaneous requests - you require one read to finish before the next starts. This means that the disk may have dead time where it is doing nothing. Since all writes depend on all reads, and your reads are serial, you're starving the disk system of read requests (doubly so, since writes will take away from reads).
The total of requested read bytes across all read requests should be larger than the bandwidth-delay product of the disk system. If the disk has 0.5 ms delay and a 4 GB/sec performance, then you want to have 4 GB * 0.5 ms = 2 MB worth of reads outstanding at all times.
You're not using any of the operating system's hints that you're doing sequential reading.
To fix this:
Change your code to have more than one outstanding read request at all times.
Have enough read requests outstanding such that you're waiting on at least 2 MBs worth of data.
Use the posix_fadvise() flags to help the OS disk schedule and page cache optimize.
Consider using mmap to cut down on overhead.
Use a larger buffer size per read request to cut down on overhead.
This answer has more information:
https://stackoverflow.com/a/3756466/344638

The problem is that you specify too small buffer for your fstream
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
Your app runs in the user mode. To write to disk, ofstream calls system write function that executed in kernel mode. Then write transfers data to system cache, then to HDD cache and then it will be written to the disk.
This buffer size affect number of system calls (1 call for every 32*1000 bytes). During system call OS must switch execution context from user mode to kernel mode and then back. Switching context is overhead. In Linux it is equivalent about 2500-3500 simple CPU commands. Because of that, your app spending the most CPU time in context switching.
In your second app you use
FILE* file = fopen("test.file", "w");
FILE using the bigger buffer by default, that is why it produce more efficient code. You can try to specify small buffer with setvbuf. In this case you should see the same performance degradation.
Please note in your case, the bottle neck is not HDD performance. It is context switching

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?

I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.

I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.

If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

Irregular file writing performance in c++

I am writing an app which receives a binary data stream wtih a simple function call like put(DataBLock, dateTime); where each data package is 4 MB
I have to write these datablocks to seperate files for future use with some additional data like id, insertion time, tag etc...
So I both tried these two methods:
first with FILE:
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
FILE* pFile;
pFile = fopen(fNameArray,"wb");
fwrite(reinterpret_cast<const char *>(&data.dataTime), 1, sizeof(data.dataTime), pFile);
data.dataInsertionTime = time(0);
fwrite(reinterpret_cast<const char *>(&data.dataInsertionTime), 1, sizeof(data.dataInsertionTime), pFile);
fwrite(reinterpret_cast<const char *>(&data.id), 1, sizeof(long), pFile);
fwrite(reinterpret_cast<const char *>(&data.tag), 1, sizeof(data.tag), pFile);
fwrite(reinterpret_cast<const char *>(&data.data_block[0]), 1, data.data_block.size() * sizeof(int), pFile);
fclose(pFile);
second with ostream:
ofstream fout;
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
fout.open(fNameArray, ios::out| ios::binary | ios::app);
fout.write(reinterpret_cast<const char *>(&data.dataTime), sizeof(data.dataTime));
data.dataInsertionTime = time(0);
fout.write(reinterpret_cast<const char *>(&data.dataInsertionTime), sizeof(data.dataInsertionTime));
fout.write(reinterpret_cast<const char *>(&data.id), sizeof(long));
fout.write(reinterpret_cast<const char *>(&data.tag), sizeof(data.tag));
fout.write(reinterpret_cast<const char *>(&data.data_block[0]), data.data_block.size() * sizeof(int));
fout.close();
In my tests the first methods looks faster, but my main problem is in both ways at first everythings goes fine, for every file writing operation it tooks almost the same time (like 20 milliseconds), but after the 250 - 300th package it starts to make some peaks like 150 to 300 milliseconds and then goes down to 20 milliseconds and then again 150 ms and so on... So it becomes very unpredictable.
When I put some timers to the code I figured out that the main reason for these peaks are because of the fout.open(...) and pfile = fopen(...) lines. I have no idea if this is because of the operating system, hard drive, any kind of cache or buffer mechanism etc...
So the question is; why these file opening lines become problematic after some time, and is there a way to make file writing operation stable, I mean fixed time?
Thanks.
NOTE: I'm using Visual studio 2008 vc++, Windows 7 x64. (I tried also for 32 bit configuration but the result is same)
EDIT: After some point writing speed slows down as well even if the opening file time is minimum. I tried with different package sizes so here are the results:
For 2 MB packages it takes double time to slow down, I mean after ~ 600th item slowing down begins
For 4 MB packages almost 300th item
For 8 MB packages almost 150th item
So it seems to me it is some sort of caching problem or something? (in hard drive or OS). But I also tried with disabling hard drive cache but nothing changed...
Any idea?

This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)

So the question is; why these file opening lines become problematic
after some time, and is there a way to make file writing operation
stable, I mean fixed time?
This observation(.i.e. varying time taken in write operation) does not mean that there is problem in OS or File System.There could be various reason behind your observation. One possible reason could be the delayed write may be used by kernel to write the data to disk. Sometime kernel cache it(buffer) in case another process should read or write it soon so that extra disk operation can be avoided.
This situation may lead to inconsistency in the time taken in different write call for same size of data/buffer.
File I/O is bit complex and complicated topic and depends on various other factors. For complete information on internal algorithm on File System, you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.
Having said that, you may want to use the flush call immediately after your write call in both version of your program(.i.e. C and C++). This way you may get the consistent time in your file I/O write time. Otherwise your programs behaviour look correct to me.
//C program
fwrite(data,fp);
fflush(fp);
//C++ Program
fout.write(data);
fout.flush();

It's possible that the spikes are not related to I/O itself, but NTFS metadata: when your file count reach some limit, some NTFS AVL-like data structure needs some refactoring and... bump!
To check it you should preallocate the file entries, for example creating all the files with zero size, and then opening them when writing, just for testing: if my theory is correct you shouldn't see your spikes anymore.
UHH - and you must disable file indexing (Windows search service) there! Just remembered of it... see here.

Linux & Windows: Using Large Files to save physical memory

Good afternoon, We have implemented a C++ cKeyArray class to test whether we can use the Large File API to save physical memory. During Centos Linux testing, we found that the Linux File API was just as fast as using the heap for random access processing. Here are the numbers: for a 2,700,000 row SQL database where the KeySize for each row is 62 bytes,
cKeyArray class using LINUX File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,763,504,445 microsecs
Each BruteForce Comparisons requires two random access, there the mean time required for each random access = 1,763,504,445 microsecs / (2 * 197275) = 4470 microsecs
Heap , no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 1,708,442,690microsecs
the mean time required for each random access = 4300 microsecs.
On 32 bit Windows,the numbers are,
cKeyArray class using Windows File API
BruteForceComparisons = 197275 BruteForceTimeElapsed = 9243787 millisecs
the mean time for each random access is 23.4 millisec
Heap, no cKeyArray class
BruteForceComparisons = 197275 BruteForceTimeElapsed = 2,141,941 millisecs
the mean time requires for each random access is 5.4 millisec
We are wondering why the Linux cKeyArray numbers are just as good the Linux heap numbers while on 32 bit Windows the mean heap random access time is 4 times as fast the cKeyArray Windows File API. Is there some way we can speed up the Windows cKeyArray File API?
Previouly, we received a lot of good suggestions from Stack Overflow on using the Windows Memory Mapped File API. Based on these Stack Overflow suggestions we have implemented a Memory Mapped File MRU caching class which functions properly.
Because we want to devlop a cross-platform solution, we want to do due diligence to see why the Linux File API is so fast? Thank you. We are trying to post a portion of the cKeyArray class implementation below.
#define KEYARRAY_THRESHOLD 100000000
// Use file instead of memory if requirement is above this number
cKeyArray::cKeyArray(long RecCount_,int KeySize_,int MatchCodeSize_, char* TmpFileName_) {
RecCount=RecCount_;
KeySize=KeySize_;
MatchCodeSize=MatchCodeSize_;
MemBuffer=0;
KeyBuffer=0;
MemFile=0;
MemFileName[0]='\x0';
ReturnBuffer=new char[MatchCodeSize + 1];
if (RecCount*KeySize<=KEYARRAY_THRESHOLD) {
InMemory=true;
MemBuffer=new char[RecCount*KeySize];
memset(MemBuffer,0,RecCount*KeySize);
} else {
InMemory=false;
strcpy(MemFileName,TmpFileName_);
try {
MemFile=
new cFile(MemFileName,cFile::CreateAlways,cFile::ReadWrite);
}
catch (cException e)
{
throw e;
}
try {
MemFile->SetFilePointer(
(int64_t)(RecCount*KeySize),cFile::FileBegin);
}
catch (cException e)
{
throw e;
}
if (!(MemFile->SetEndOfFile()))
throw cException(ERR_FILEOPEN,MemFileName);
KeyBuffer=new char[KeySize];
}
}
char *cKeyArray::GetKey(long Record_) {
memset(ReturnBuffer,0,MatchCodeSize + 1);
if (InMemory) {
memcpy(ReturnBuffer,MemBuffer+Record_*KeySize,MatchCodeSize);
} else {
MemFile->SetFilePointer((int64_t)(Record_*KeySize),cFile::FileBegin);
MemFile->ReadFile(KeyBuffer,KeySize);
memcpy(ReturnBuffer,KeyBuffer,MatchCodeSize);
}
return ReturnBuffer;
}
uint32_t cKeyArray::GetDupeGroup(long Record_) {
uint32_t DupeGroup(0);
if (InMemory) {
memcpy((char*)&DupeGroup,
MemBuffer+Record_*KeySize + MatchCodeSize,sizeof(uint32_t));
} else {
MemFile->SetFilePointer(
(int64_t)(Record_*KeySize + MatchCodeSize) ,cFile::FileBegin);
MemFile->ReadFile((char*)&DupeGroup,sizeof(uint32_t));
}
return DupeGroup;
}

On Linux, the OS aggressively caches file data in main memory -- so although you haven't explicitly allocated memory for the file contents, they are nevertheless stored in RAM. Here's a decent link with some more information about the page cache -- only one thing is missing from that description, which is that most Linux filesystems actually implement the standard I/O interfaces as thin wrappers around the page cache. That means that even though you haven't explicitly memory mapped the file, the system is still treating it as though it were memory mapped under-the-covers. That's why you see roughly equivalent performance with either approach.
I second the suggestion to factor the platform-specific stuff out, and use whichever appropach is fastest for each platform. Be sure to benchmark -- don't ever make assumptions about performance.

Your memory-mapped solution should be as much as 10x faster than the file solution even in Linux. That is the speed I experience in my test cases.
Each file access system call takes hundreds of CPU cycles to complete. Time which your program could be using to do real work.
One explanation for why the speeds are similar could be that your memory map has not been used before. When a memory mapped page is accessed for the first time it must be assigned to a physical page of RAM and zeroed out or if it is a disk file it must be loaded from disk into RAM. All of that takes a considerable amount of time.
If you touch (read or write a value) each 4K of RAM before using it you should see a significant speed increase in the memory map.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js