win32 I/O performance issues - c++

I have a problem with win32 I/O performance:
I'm trying to achieve a decent writing speed using OpenFile/WriteFile.
Using Resource Monitor (it comes with windows) I measured the writing speed of the following piece of code and I found that it is writing at 2MB/sec...
HANDLE hFile = INVALID_HANDLE_VALUE;
hFile = CreateFile(
L"test",
(GENERIC_READ | GENERIC_WRITE),
FILE_SHARE_READ,
NULL,
OPEN_ALWAYS,
(FILE_ATTRIBUTE_NORMAL |
FILE_FLAG_WRITE_THROUGH |
FILE_FLAG_NO_BUFFERING),
NULL);
if (hFile != INVALID_HANDLE_VALUE)
{
//OK
unsigned long bytesWritten = 0;
unsigned long* Buffer = (unsigned long*)malloc(4096*sizeof(unsigned long));
ZeroMemory(Buffer, 4096); //thanks to 'bash.d'
while (true)
{
/*the infinite loop is intentional
because I wanted to see if the writing speed of 2MB/sec
was right */
WriteFile(hFile,
Buffer,
4096,
&bytesWritten,
NULL);
if (bytesWritten <= 0)
{
break;
}
}
}
I tried with the following and it's the same...
hFile = CreateFile(
L"test",
(GENERIC_READ | GENERIC_WRITE),
FILE_SHARE_READ,
NULL,
OPEN_ALWAYS,
(FILE_ATTRIBUTE_NORMAL);
What am I doing wrong(about the writing speed) ? and how can I improve the writing speed ?
Thank you and sorry for my english
Edit:
I'm writing on a local disk

This is very interesting, and similar to an issue I have, and can reproduce on 2 different servers with Windows Server 2003 SP2 64-bit (single hard drives, not RAID). Simply doing a WriteFile() of 36 bytes and then 99964 bytes in a loop produces similar behavior (I'm guessing it would be the same with a single write, and some other versions of Windows; that's just what I happened to be using). CPU usage starts off very low, and then increases gradually -- on one server, the test was around 50% CPU usage at around 175GB (about 95% of that is kernel time; 60% in my program and 40% in 'System').
You may also try async IO to get the test performance. That is opening the file with FILE_FLAG_OVERLAPPED and using the LPOVERLAPPED argument of WriteFile. You may or may not get better performance with FILE_FLAG_NO_BUFFERING. You will have to test to see.
FILE_FLAG_NO_BUFFERING will generally give you more consistent speeds and better streaming behavior, and it avoids polluting your disk cache with data that you may not need again, but it isn't necessarily faster overall.
You should also test to see what the best size is for each block of IO. In my experience There is a huge performance difference between copying a file 4k at a time and copying it 1Mb at a time.
In my past testing of this (a few years ago) I found that block sizes below about 64kB were dominated by overhead, and total throughput continued to improve with larger block sizes up to about 512KB. I wouldn't be surprised if with today's drives you needed to use block sizes larger than 1MB to get maximum throughput.
The numbers you are currently using appear to be reasonable, but may not be optimal. Also I'm fairly certain that FILE_FLAG_WRITE_THROUGH prevents the use of the on-disk cache and thus will cost you a fair bit of performance.
It would be of worth to try below thing...
1) Enabling FILE_FLAG_SEQUENTIAL_SCAN flag
2) "Enable advanced performance" in "Disk Policies" in the Device Manager
3) Varying disk chunk size from 64 KB to 4096 ...
4) Try FILE_FLAG_NO_BUFFERING

Use async IO bound to a completion port
Pre-grow the file using SetFileValidData
Open the handle with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH
A consumer grade drive (even 5400RPM) should be able to write ~130MB/sec (single spindle, no raid). No other IO should occur at the same time (no head movement).
See https://github.com/rusanu/writing-a-binary-file-in-c-very-fast for an example.

Related

WriteFile fails for > 4700 blocks (SD card raw write / Window)

I am writing / reading raw data on a SD card. The code for writing is working up to approx. 4700 blocks and fails after this limit. Here is the code:
//Data to be written
uint8_t* sessions;
sessions = (uint8_t *) malloc(2048*sizeof(uint8_t));
unsigned int i;
for(i=0;i<(2048*sizeof(uint8_t));i++) sessions[i]=8;
DWORD dwWrite;
HANDLE hDisk=CreateFileA("\\\\.\\K:", // drive to open = SD CARD
GENERIC_WRITE, // access to the drive
FILE_SHARE_READ | // share mode
FILE_SHARE_WRITE,
NULL, // default security attributes
OPEN_EXISTING, // disposition
FILE_FLAG_NO_BUFFERING, // file attributes
NULL); // do not copy file attributes
if(hDisk==INVALID_HANDLE_VALUE)
{
CloseHandle(hDisk);
printf("ERROR opening the file !!! ");
}
DWORD dwPtr = SetFilePointer(hDisk,10000*512,0,FILE_BEGIN); //4700 OK
if (dwPtr == INVALID_SET_FILE_POINTER) // Test for failure
{
printf("CANNOT move the file pointer !!! ");
}
//Try using this structure but same results: CAN BE IGNORED
OVERLAPPED osWrite = {0,0,0};
memset(&osWrite, 0, sizeof(osWrite));
osWrite.Offset = 10000*512; //4700 OK
osWrite.hEvent = CreateEvent(FALSE, FALSE, FALSE, FALSE);
if( FALSE == WriteFile(hDisk,sessions,2048,&dwWrite,&osWrite) ){
printf("CANNOT write data to the SD card!!! %lu",dwWrite);
}else{
printf("Written %lu on SD card",dwWrite);
}
CloseHandle(hDisk);
The issue is with the function "Writefile" (windows.h). If the number of block is less than 4700. everything is fine (data are written on the SD card) but if the block number is let's say 5000 or 10000, the function fails "Written 0".
Notice that without FILE_FLAG_NO_BUFFERING, no way to open the drive (SD card). The "OVERLAPPED" is a failed attempt to make it works, not using it (WriteFile(hDisk,sessions,2048,&dwWrite,NULL) )leads to the same behaviour. "SetFilePointer" works also for blocks higher than 4700. Have tested as well 2 different SD cards. I am on Windows 10.
Any hint as to what is happening?
Thank you for your input
From the documentation for WriteFile:
A write on a volume handle will succeed if the volume does not have a mounted file system, or if one of the following conditions is true:
The sectors to be written to are boot sectors.
The sectors to be written to reside outside of file system space.
You have explicitly locked or dismounted the volume by using FSCTL_LOCK_VOLUME or FSCTL_DISMOUNT_VOLUME.
The volume has no actual file system. (In other words, it has a RAW file system mounted.)
You are able to write to the first couple of megabytes because (for historical reasons) the file system doesn't use that space. In order to write to the rest of the volume, you'll first have to lock the volume using the FSCTL_LOCK_VOLUME control code.
You should pass Null as the 3rd parameter of SetFilePointer, lpDistanceToMoveHigh, unless you are using the higher order 32 bits of a 64-bit address. Also, if you are not using the OVERLAPPED structure, make sure to pass Null to WriteFile for that parameter.
Also, be sure that you are not having any overflows for the data types you are using. And, be mindful of the addressing limitations of the system you are working on.
MSDN WriteFile
MSDN SetFilePointer

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

HDD benchmark in C++ - measured transfer speed is too fast

I am m trying to develop a mini benchmarking system in C++ and I have trouble measuring the HDD read and write speed. More exactly, the transfer speed measured by me is huge: 400-600 MB/s for read and above 1000 MB/s for write. I have a 5400 RPM hard disk drive (not SSD), the real read/write speed (according to a benchmarking program) is roughly about 60 MB/s.
//blockSize is 4096
//my data buffer
char* mydata = (char*)malloc(1*blockSize);
//initialized with random data
srand(time(NULL));
for(int i=0;i<blockSize;i++){
mydata[i] = rand()%256;
}
double startt, endt, difft;
int times = 10*25000;
int i=0,j=0;
DWORD written;
HANDLE f, g;
DWORD read;
f=CreateFileA(
"newfolder/myfile.txt",
GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
);
if(f==INVALID_HANDLE_VALUE){
std::cout<<"Error openning for write.";
return -1;
}
startt = clock();
for(i=0;i<times;i++){
WriteFile(
f,
mydata,
blockSize,
&written,
NULL
);
}
endt = clock();
difft = 1.0*(endt-startt)/(1.0*CLOCKS_PER_SEC);
std::cout<<"\nWrite time: "<<difft;
std::cout<<"\nWrite speed: "<<1.0*times*blockSize/difft/1024/1024<<" MB/s";
CloseHandle(f);
//------------------------------------------------------------------------------------------------
g=CreateFile("newfolder/myfile.txt",
GENERIC_READ,
0,
NULL,
OPEN_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
);
if(g==INVALID_HANDLE_VALUE){
std::cout<<"Error opening for read.";
return -1;
}
startt = clock();
for(i=0;i<times;i++){
ReadFile(
g,
mydata,
blockSize,
&read,
NULL
);
}
endt = clock();
difft = 1.0*(endt-startt)/(1.0*CLOCKS_PER_SEC);
std::cout<<"\nRead time:"<<difft;
std::cout<<"\nRead speed: "<<1.0*times*blockSize/difft/1024/1024<<" MB/s";
CloseHandle(g);
I tried using fopen and fwrite functions too and I got similar results.
I ran my app on another computer. The write speed was about right, the read speed was still huge.
The most interesting thing is that the application actually creates a 1GB file in about 2 seconds which corresponds to a 500 MB/s write speed.
Does anybody have any idea what am I doing wrong?
Technically, you are doing nothing wrong. The problem is, that every OS uses caching for all I/O operations. The HDD itself also caches some data, so it can perform them efficiently.
This question is very platform-specific. You would need to fool caching somehow.
Perhaps, you should look at this library: Bonnie++. You may find it useful. It was written for Unix systems, but source code could reveal some useful techniques.
On Windows, based on this resource, additional flag FILE_FLAG_NO_BUFFERING passed to CreateFile function should be enough to disable buffering for this file.
Quote:
In these situations, caching can be turned off. This is done at the time the file is opened by passing FILE_FLAG_NO_BUFFERING as a value for the dwFlagsAndAttributes parameter of CreateFile. When caching is disabled, all read and write operations directly access the physical disk. However, the file metadata may still be cached. To flush the metadata to disk, use the FlushFileBuffers function.
You are measuring the performance of cache.
Try storing a lot more data than that, once the cache fills the data should be written straight to the disk.
I think I have figured it out.
Unbuffered file writing speed depends on the size of data the WriteFile function is writing. My experiments show that the bigger the data size, the higher the writing speed. For large amounts of data (>1MB) it even outperforms buffered writing, which I was able to measure by writing data larger than 2GB.
To summarize, one can measure the hard drive writing speed accurately by:
Opening the file using CreateFile and setting the FILE_FLAG_NO_BUFFERING flag.
Writing a lot of data at a time, using WriteFile.

Is there a way not to use cache in c++

This question might be a bit weird but I wonder if there is a way NOT to use cache in c++.
I'm doing some tests, in this test I'm loading 2 GB (512*4 MB matrices) to memory, then do some correlations among them and calculate performance.
When I run the code for the 1st run, the running time is t1+x second, in the 2nd run, total time is t2+x seconds where t1 and t2 are loading time of 2 GB matrices and t1 > t2. (approx. t1=20, t2=5 sec). My assumption is it is because in the 2nd run, cache is used. (I don't know if there can be any other reason that decreases loading time like that.)
My problem with this is since there is no standards in loading times, the results are deceptive in some cases. So I want a standard in IO time. The only thing comes to my mind is not to use cache if there is a way.
Is there a way to standardize my IO time?
I'm using Windows 7 x64 and working on visual studio 2010, my RAM is 32 GB.
TEST RESULTS: I've compared average loading times of 4MB binary file in 5 options.
The options are 1st run with my original code, 2nd run with original code, using FILE_FLAG_NO_BUFFER, 1st run using cache and 2nd run as Roy Longbottom suggested.
1st run : 39.1 ms
2nd run : 10.4 ms
no_buffer : 127.8 ms
cache_1st run : 27.4 ms
cache_2nd run : 19.2 ms
My original read code is as follows:
void readNoise(string fpath,Mat& data){
FILE* fp = fopen(fpath.c_str(),"rb");
if (!fp)perror("fopen");
float* buffer= new float[size];
for(int i=0;i<size;++i) {
fread(buffer,sizeof(float),size,fp);
for(int j=0;j<size;++j){
data.at<float>(i,j)=buffer[j];
}
}
fclose(fp);
free(buffer);
}
I noticed a mistake in my code which is to do dynamic allocation, when I change dynamic allocation to static allocation, running time of readNoise method becomes same as cache used version of Roy Longbottom.
The difference of two run decreased but the question remains the same: "How to standardize running time of both first and second run"?
Benchmarking, specifically micro-benchmarking is a quite complex scenario and there are many ways you may inadvertently gather false performance data. You should look into micro benchmarking libraries such as google/benchmark and use one of them to perform your tests.
As you can see from your example, external factors such as the file system cache may cause the timing of individual runs to vary greatly.
Following is the code I use for my Windows drivespeed32 benchmark (free stuff - Google for drivespeed32), followed by results for 2000 MB files via Windows 7 and cached speeds for a smaller file. Code for Linux version also shown.
if (useCache)
{
hFile = CreateFile(testFile, GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN, NULL);
}
else
{
hFile = CreateFile(testFile, GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN
| FILE_FLAG_NO_BUFFERING, NULL);
}
if (hFile == INVALID_HANDLE_VALUE)
{
SetCurrentDirectory(currentDir);
printf (" Cannot open data file for reading\n\n");
fprintf (outfile, " Cannot open data file for reading\n\n");
fclose(outfile);
printf(" Press Enter\n");
g = getchar();
return 0;
}
Intended for smaller files like 8, 16, 32 MB, so times out after 1 set.
2000 MB File 1 2 3 4 5
Writing MB/sec 85.51 85.40 85.64 83.79 83.19
Reading MB/sec 84.34 85.77 85.60 85.88 85.15
Running Time Too Long At 246 Seconds - No More File Sizes
---------------------------------------------------------------------
8 MB Cached File 1 2 3 4 5
Writing MB/sec 1650.43 1432.86 1536.61 1504.16 1481.58
Reading MB/sec 2225.53 2361.99 2271.81 2235.04 2316.13
Linux Version
if (useCache)
{
handle = open(testFile, O_RDONLY);
}
else
{
handle = open(testFile, O_RDONLY | O_DIRECT);
}
if (handle == -1)
{
printf (" Cannot open data file for reading\n\n");
fprintf (outfile, " Cannot open data file for reading\n\n");
fclose(outfile);
printf(" Press Enter\n");
g = getchar();
return 0;
}

Irregular file writing performance in c++

I am writing an app which receives a binary data stream wtih a simple function call like put(DataBLock, dateTime); where each data package is 4 MB
I have to write these datablocks to seperate files for future use with some additional data like id, insertion time, tag etc...
So I both tried these two methods:
first with FILE:
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
FILE* pFile;
pFile = fopen(fNameArray,"wb");
fwrite(reinterpret_cast<const char *>(&data.dataTime), 1, sizeof(data.dataTime), pFile);
data.dataInsertionTime = time(0);
fwrite(reinterpret_cast<const char *>(&data.dataInsertionTime), 1, sizeof(data.dataInsertionTime), pFile);
fwrite(reinterpret_cast<const char *>(&data.id), 1, sizeof(long), pFile);
fwrite(reinterpret_cast<const char *>(&data.tag), 1, sizeof(data.tag), pFile);
fwrite(reinterpret_cast<const char *>(&data.data_block[0]), 1, data.data_block.size() * sizeof(int), pFile);
fclose(pFile);
second with ostream:
ofstream fout;
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
fout.open(fNameArray, ios::out| ios::binary | ios::app);
fout.write(reinterpret_cast<const char *>(&data.dataTime), sizeof(data.dataTime));
data.dataInsertionTime = time(0);
fout.write(reinterpret_cast<const char *>(&data.dataInsertionTime), sizeof(data.dataInsertionTime));
fout.write(reinterpret_cast<const char *>(&data.id), sizeof(long));
fout.write(reinterpret_cast<const char *>(&data.tag), sizeof(data.tag));
fout.write(reinterpret_cast<const char *>(&data.data_block[0]), data.data_block.size() * sizeof(int));
fout.close();
In my tests the first methods looks faster, but my main problem is in both ways at first everythings goes fine, for every file writing operation it tooks almost the same time (like 20 milliseconds), but after the 250 - 300th package it starts to make some peaks like 150 to 300 milliseconds and then goes down to 20 milliseconds and then again 150 ms and so on... So it becomes very unpredictable.
When I put some timers to the code I figured out that the main reason for these peaks are because of the fout.open(...) and pfile = fopen(...) lines. I have no idea if this is because of the operating system, hard drive, any kind of cache or buffer mechanism etc...
So the question is; why these file opening lines become problematic after some time, and is there a way to make file writing operation stable, I mean fixed time?
Thanks.
NOTE: I'm using Visual studio 2008 vc++, Windows 7 x64. (I tried also for 32 bit configuration but the result is same)
EDIT: After some point writing speed slows down as well even if the opening file time is minimum. I tried with different package sizes so here are the results:
For 2 MB packages it takes double time to slow down, I mean after ~ 600th item slowing down begins
For 4 MB packages almost 300th item
For 8 MB packages almost 150th item
So it seems to me it is some sort of caching problem or something? (in hard drive or OS). But I also tried with disabling hard drive cache but nothing changed...
Any idea?
This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)
So the question is; why these file opening lines become problematic
after some time, and is there a way to make file writing operation
stable, I mean fixed time?
This observation(.i.e. varying time taken in write operation) does not mean that there is problem in OS or File System.There could be various reason behind your observation. One possible reason could be the delayed write may be used by kernel to write the data to disk. Sometime kernel cache it(buffer) in case another process should read or write it soon so that extra disk operation can be avoided.
This situation may lead to inconsistency in the time taken in different write call for same size of data/buffer.
File I/O is bit complex and complicated topic and depends on various other factors. For complete information on internal algorithm on File System, you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.
Having said that, you may want to use the flush call immediately after your write call in both version of your program(.i.e. C and C++). This way you may get the consistent time in your file I/O write time. Otherwise your programs behaviour look correct to me.
//C program
fwrite(data,fp);
fflush(fp);
//C++ Program
fout.write(data);
fout.flush();
It's possible that the spikes are not related to I/O itself, but NTFS metadata: when your file count reach some limit, some NTFS AVL-like data structure needs some refactoring and... bump!
To check it you should preallocate the file entries, for example creating all the files with zero size, and then opening them when writing, just for testing: if my theory is correct you shouldn't see your spikes anymore.
UHH - and you must disable file indexing (Windows search service) there! Just remembered of it... see here.