HDD benchmark in C++ - measured transfer speed is too fast

HDD benchmark in C++ - measured transfer speed is too fast - c++

I am m trying to develop a mini benchmarking system in C++ and I have trouble measuring the HDD read and write speed. More exactly, the transfer speed measured by me is huge: 400-600 MB/s for read and above 1000 MB/s for write. I have a 5400 RPM hard disk drive (not SSD), the real read/write speed (according to a benchmarking program) is roughly about 60 MB/s.
//blockSize is 4096
//my data buffer
char* mydata = (char*)malloc(1*blockSize);
//initialized with random data
srand(time(NULL));
for(int i=0;i<blockSize;i++){
mydata[i] = rand()%256;
}
double startt, endt, difft;
int times = 10*25000;
int i=0,j=0;
DWORD written;
HANDLE f, g;
DWORD read;
f=CreateFileA(
"newfolder/myfile.txt",
GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
);
if(f==INVALID_HANDLE_VALUE){
std::cout<<"Error openning for write.";
return -1;
}
startt = clock();
for(i=0;i<times;i++){
WriteFile(
f,
mydata,
blockSize,
&written,
NULL
);
}
endt = clock();
difft = 1.0*(endt-startt)/(1.0*CLOCKS_PER_SEC);
std::cout<<"\nWrite time: "<<difft;
std::cout<<"\nWrite speed: "<<1.0*times*blockSize/difft/1024/1024<<" MB/s";
CloseHandle(f);
//------------------------------------------------------------------------------------------------
g=CreateFile("newfolder/myfile.txt",
GENERIC_READ,
0,
NULL,
OPEN_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL
);
if(g==INVALID_HANDLE_VALUE){
std::cout<<"Error opening for read.";
return -1;
}
startt = clock();
for(i=0;i<times;i++){
ReadFile(
g,
mydata,
blockSize,
&read,
NULL
);
}
endt = clock();
difft = 1.0*(endt-startt)/(1.0*CLOCKS_PER_SEC);
std::cout<<"\nRead time:"<<difft;
std::cout<<"\nRead speed: "<<1.0*times*blockSize/difft/1024/1024<<" MB/s";
CloseHandle(g);
I tried using fopen and fwrite functions too and I got similar results.
I ran my app on another computer. The write speed was about right, the read speed was still huge.
The most interesting thing is that the application actually creates a 1GB file in about 2 seconds which corresponds to a 500 MB/s write speed.
Does anybody have any idea what am I doing wrong?

Technically, you are doing nothing wrong. The problem is, that every OS uses caching for all I/O operations. The HDD itself also caches some data, so it can perform them efficiently.
This question is very platform-specific. You would need to fool caching somehow.
Perhaps, you should look at this library: Bonnie++. You may find it useful. It was written for Unix systems, but source code could reveal some useful techniques.
On Windows, based on this resource, additional flag FILE_FLAG_NO_BUFFERING passed to CreateFile function should be enough to disable buffering for this file.
Quote:
In these situations, caching can be turned off. This is done at the time the file is opened by passing FILE_FLAG_NO_BUFFERING as a value for the dwFlagsAndAttributes parameter of CreateFile. When caching is disabled, all read and write operations directly access the physical disk. However, the file metadata may still be cached. To flush the metadata to disk, use the FlushFileBuffers function.

You are measuring the performance of cache.
Try storing a lot more data than that, once the cache fills the data should be written straight to the disk.

I think I have figured it out.
Unbuffered file writing speed depends on the size of data the WriteFile function is writing. My experiments show that the bigger the data size, the higher the writing speed. For large amounts of data (>1MB) it even outperforms buffered writing, which I was able to measure by writing data larger than 2GB.
To summarize, one can measure the hard drive writing speed accurately by:
Opening the file using CreateFile and setting the FILE_FLAG_NO_BUFFERING flag.
Writing a lot of data at a time, using WriteFile.

Related

WriteFile fails for > 4700 blocks (SD card raw write / Window)

I am writing / reading raw data on a SD card. The code for writing is working up to approx. 4700 blocks and fails after this limit. Here is the code:
//Data to be written
uint8_t* sessions;
sessions = (uint8_t *) malloc(2048*sizeof(uint8_t));
unsigned int i;
for(i=0;i<(2048*sizeof(uint8_t));i++) sessions[i]=8;
DWORD dwWrite;
HANDLE hDisk=CreateFileA("\\\\.\\K:", // drive to open = SD CARD
GENERIC_WRITE, // access to the drive
FILE_SHARE_READ | // share mode
FILE_SHARE_WRITE,
NULL, // default security attributes
OPEN_EXISTING, // disposition
FILE_FLAG_NO_BUFFERING, // file attributes
NULL); // do not copy file attributes
if(hDisk==INVALID_HANDLE_VALUE)
{
CloseHandle(hDisk);
printf("ERROR opening the file !!! ");
}
DWORD dwPtr = SetFilePointer(hDisk,10000*512,0,FILE_BEGIN); //4700 OK
if (dwPtr == INVALID_SET_FILE_POINTER) // Test for failure
{
printf("CANNOT move the file pointer !!! ");
}
//Try using this structure but same results: CAN BE IGNORED
OVERLAPPED osWrite = {0,0,0};
memset(&osWrite, 0, sizeof(osWrite));
osWrite.Offset = 10000*512; //4700 OK
osWrite.hEvent = CreateEvent(FALSE, FALSE, FALSE, FALSE);
if( FALSE == WriteFile(hDisk,sessions,2048,&dwWrite,&osWrite) ){
printf("CANNOT write data to the SD card!!! %lu",dwWrite);
}else{
printf("Written %lu on SD card",dwWrite);
}
CloseHandle(hDisk);
The issue is with the function "Writefile" (windows.h). If the number of block is less than 4700. everything is fine (data are written on the SD card) but if the block number is let's say 5000 or 10000, the function fails "Written 0".
Notice that without FILE_FLAG_NO_BUFFERING, no way to open the drive (SD card). The "OVERLAPPED" is a failed attempt to make it works, not using it (WriteFile(hDisk,sessions,2048,&dwWrite,NULL) )leads to the same behaviour. "SetFilePointer" works also for blocks higher than 4700. Have tested as well 2 different SD cards. I am on Windows 10.
Any hint as to what is happening?
Thank you for your input

From the documentation for WriteFile:
A write on a volume handle will succeed if the volume does not have a mounted file system, or if one of the following conditions is true:
The sectors to be written to are boot sectors.
The sectors to be written to reside outside of file system space.
You have explicitly locked or dismounted the volume by using FSCTL_LOCK_VOLUME or FSCTL_DISMOUNT_VOLUME.
The volume has no actual file system. (In other words, it has a RAW file system mounted.)
You are able to write to the first couple of megabytes because (for historical reasons) the file system doesn't use that space. In order to write to the rest of the volume, you'll first have to lock the volume using the FSCTL_LOCK_VOLUME control code.

You should pass Null as the 3rd parameter of SetFilePointer, lpDistanceToMoveHigh, unless you are using the higher order 32 bits of a 64-bit address. Also, if you are not using the OVERLAPPED structure, make sure to pass Null to WriteFile for that parameter.
Also, be sure that you are not having any overflows for the data types you are using. And, be mindful of the addressing limitations of the system you are working on.
MSDN WriteFile
MSDN SetFilePointer

Why is my C++ disk write test much slower than a simply file copy using bash?

Using below program I try to test how fast I can write to disk using std::ofstream.
I achieve around 300 MiB/s when writing a 1 GiB file.
However, a simple file copy using the cp command is at least twice as fast.
Is my program hitting the hardware limit or can it be made faster?
#include <chrono>
#include <iostream>
#include <fstream>
char payload[1000 * 1000]; // 1 MB
void test(int MB)
{
// Configure buffer
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != MB; ++i)
{
of.write(payload, sizeof(payload));
}
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Payload=" << MB << "MB Speed=" << megabytes_per_s << "MB/s" << std::endl;
}
int main()
{
for (auto i = 1; i <= 10; ++i)
{
test(i * 100);
}
}
Output:
Payload=100MB Speed=3792.06MB/s
Payload=200MB Speed=1790.41MB/s
Payload=300MB Speed=1204.66MB/s
Payload=400MB Speed=910.37MB/s
Payload=500MB Speed=722.704MB/s
Payload=600MB Speed=579.914MB/s
Payload=700MB Speed=499.281MB/s
Payload=800MB Speed=462.131MB/s
Payload=900MB Speed=411.414MB/s
Payload=1000MB Speed=364.613MB/s
Update
I changed from std::ofstream to fwrite:
#include <chrono>
#include <cstdio>
#include <iostream>
char payload[1024 * 1024]; // 1 MiB
void test(int number_of_megabytes)
{
FILE* file = fopen("test.file", "w");
auto start_time = std::chrono::steady_clock::now();
// Write a total of 1 GB
for (auto i = 0; i != number_of_megabytes; ++i)
{
fwrite(payload, 1, sizeof(payload), file );
}
fclose(file); // TODO: RAII
double elapsed_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - start_time).count();
double megabytes_per_ns = 1e3 / elapsed_ns;
double megabytes_per_s = 1e9 * megabytes_per_ns;
std::cout << "Size=" << number_of_megabytes << "MiB Duration=" << long(0.5 + 100 * elapsed_ns/1e9)/100.0 << "s Speed=" << megabytes_per_s << "MiB/s" << std::endl;
}
int main()
{
test(256);
test(512);
test(1024);
test(1024);
}
Which improves the speed to 668MiB/s for a 1 GiB file:
Size=256MiB Duration=0.4s Speed=2524.66MiB/s
Size=512MiB Duration=0.79s Speed=1262.41MiB/s
Size=1024MiB Duration=1.5s Speed=664.521MiB/s
Size=1024MiB Duration=1.5s Speed=668.85MiB/s
Which is just as fast as dd:
time dd if=/dev/zero of=test.file bs=1024 count=0 seek=1048576
real 0m1.539s
user 0m0.001s
sys 0m0.344s

First, you're not really measuring the disk writing speed, but (partly) the speed of writing data to the OS disk cache. To really measure the disk writing speed, the data should be flushed to disk before calculating the time. Without flushing there could be a difference depending on the file size and the available memory.
There seems to be something wrong in the calculations too. You're not using the value of MB.
Also make sure the buffer size is a power of two, or at least a multiple of the disk page size (4096 bytes): char buffer[32 * 1024];. You might as well do that for payload too. (looks like you changed that from 1024 to 1000 in an edit where you added the calculations).
Do not use streams to write a (binary) buffer of data to disk, but instead write directly to the file, using FILE*, fopen(), fwrite(), fclose(). See this answer for an example and some timings.
To copy a file: open the source file in read-only and, if possible, forward-only mode, and using fread(), fwrite():
while fread() from source to buffer
fwrite() buffer to destination file
This should give you a speed comparable to the speed of an OS file copy (you might want to test some different buffer sizes).
This might be slightly faster using memory mapping:
open src, create memory mapping over the file
open/create dest, set file size to size of src, create memory mapping over the file
memcpy() src to dest
For large files smaller mapped views should be used.

Streams are slow
cp uses syscalls directly read(2) or mmap(2).

I'd wager that it's something clever inside either CP or the filesystem. If it's inside CP then it might be that the file that you are copying has a lot of 0s in it and cp is detecting this and writing a sparse version of your file. The man page for cp says "By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well." This could mean a few things but one of them is that cp could make a sparse version of your file which would require less disk write time.
If it's within your filesystem then it might be Deduplication.
As a long-shot 3rd, it might also be something within your OS or your disk firmware that is translating the read and write into some specialized instruction that doesn't require as much synchronization as your program requires (lower bus use means less latency).

You're using a relatively small buffer size. Small buffers mean more operations per second, which increases overhead. Disk systems have a small amount of latency before they receive the read/write request and begin processing it; a larger buffer amortizes that cost a little better. A smaller buffer may also mean that the disk is spending more time seeking.
You're not issuing multiple simultaneous requests - you require one read to finish before the next starts. This means that the disk may have dead time where it is doing nothing. Since all writes depend on all reads, and your reads are serial, you're starving the disk system of read requests (doubly so, since writes will take away from reads).
The total of requested read bytes across all read requests should be larger than the bandwidth-delay product of the disk system. If the disk has 0.5 ms delay and a 4 GB/sec performance, then you want to have 4 GB * 0.5 ms = 2 MB worth of reads outstanding at all times.
You're not using any of the operating system's hints that you're doing sequential reading.
To fix this:
Change your code to have more than one outstanding read request at all times.
Have enough read requests outstanding such that you're waiting on at least 2 MBs worth of data.
Use the posix_fadvise() flags to help the OS disk schedule and page cache optimize.
Consider using mmap to cut down on overhead.
Use a larger buffer size per read request to cut down on overhead.
This answer has more information:
https://stackoverflow.com/a/3756466/344638

The problem is that you specify too small buffer for your fstream
char buffer[32 * 1000];
std::ofstream of("test.file");
of.rdbuf()->pubsetbuf(buffer, sizeof(buffer));
Your app runs in the user mode. To write to disk, ofstream calls system write function that executed in kernel mode. Then write transfers data to system cache, then to HDD cache and then it will be written to the disk.
This buffer size affect number of system calls (1 call for every 32*1000 bytes). During system call OS must switch execution context from user mode to kernel mode and then back. Switching context is overhead. In Linux it is equivalent about 2500-3500 simple CPU commands. Because of that, your app spending the most CPU time in context switching.
In your second app you use
FILE* file = fopen("test.file", "w");
FILE using the bigger buffer by default, that is why it produce more efficient code. You can try to specify small buffer with setvbuf. In this case you should see the same performance degradation.
Please note in your case, the bottle neck is not HDD performance. It is context switching

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.

I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

win32 I/O performance issues

I have a problem with win32 I/O performance:
I'm trying to achieve a decent writing speed using OpenFile/WriteFile.
Using Resource Monitor (it comes with windows) I measured the writing speed of the following piece of code and I found that it is writing at 2MB/sec...
HANDLE hFile = INVALID_HANDLE_VALUE;
hFile = CreateFile(
L"test",
(GENERIC_READ | GENERIC_WRITE),
FILE_SHARE_READ,
NULL,
OPEN_ALWAYS,
(FILE_ATTRIBUTE_NORMAL |
FILE_FLAG_WRITE_THROUGH |
FILE_FLAG_NO_BUFFERING),
NULL);
if (hFile != INVALID_HANDLE_VALUE)
{
//OK
unsigned long bytesWritten = 0;
unsigned long* Buffer = (unsigned long*)malloc(4096*sizeof(unsigned long));
ZeroMemory(Buffer, 4096); //thanks to 'bash.d'
while (true)
{
/*the infinite loop is intentional
because I wanted to see if the writing speed of 2MB/sec
was right */
WriteFile(hFile,
Buffer,
4096,
&bytesWritten,
NULL);
if (bytesWritten <= 0)
{
break;
}
}
}
I tried with the following and it's the same...
hFile = CreateFile(
L"test",
(GENERIC_READ | GENERIC_WRITE),
FILE_SHARE_READ,
NULL,
OPEN_ALWAYS,
(FILE_ATTRIBUTE_NORMAL);
What am I doing wrong(about the writing speed) ? and how can I improve the writing speed ?
Thank you and sorry for my english
Edit:
I'm writing on a local disk

This is very interesting, and similar to an issue I have, and can reproduce on 2 different servers with Windows Server 2003 SP2 64-bit (single hard drives, not RAID). Simply doing a WriteFile() of 36 bytes and then 99964 bytes in a loop produces similar behavior (I'm guessing it would be the same with a single write, and some other versions of Windows; that's just what I happened to be using). CPU usage starts off very low, and then increases gradually -- on one server, the test was around 50% CPU usage at around 175GB (about 95% of that is kernel time; 60% in my program and 40% in 'System').
You may also try async IO to get the test performance. That is opening the file with FILE_FLAG_OVERLAPPED and using the LPOVERLAPPED argument of WriteFile. You may or may not get better performance with FILE_FLAG_NO_BUFFERING. You will have to test to see.
FILE_FLAG_NO_BUFFERING will generally give you more consistent speeds and better streaming behavior, and it avoids polluting your disk cache with data that you may not need again, but it isn't necessarily faster overall.
You should also test to see what the best size is for each block of IO. In my experience There is a huge performance difference between copying a file 4k at a time and copying it 1Mb at a time.
In my past testing of this (a few years ago) I found that block sizes below about 64kB were dominated by overhead, and total throughput continued to improve with larger block sizes up to about 512KB. I wouldn't be surprised if with today's drives you needed to use block sizes larger than 1MB to get maximum throughput.
The numbers you are currently using appear to be reasonable, but may not be optimal. Also I'm fairly certain that FILE_FLAG_WRITE_THROUGH prevents the use of the on-disk cache and thus will cost you a fair bit of performance.
It would be of worth to try below thing...
1) Enabling FILE_FLAG_SEQUENTIAL_SCAN flag
2) "Enable advanced performance" in "Disk Policies" in the Device Manager
3) Varying disk chunk size from 64 KB to 4096 ...
4) Try FILE_FLAG_NO_BUFFERING

Use async IO bound to a completion port
Pre-grow the file using SetFileValidData
Open the handle with FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH
A consumer grade drive (even 5400RPM) should be able to write ~130MB/sec (single spindle, no raid). No other IO should occur at the same time (no head movement).
See https://github.com/rusanu/writing-a-binary-file-in-c-very-fast for an example.

How to optimize time for writing a lot files (saving frames from video)

Background
I am currently working on a small application that grabs the RGB and depth map streams from a Microsoft Kinect device and saves them on disk for future analysis. Whn I run the program it shall output each frame as a separate image on disk.
The framerate of the Kinect is 30fps, but there are two sources so make this (approximately) 60fps. If I naively try to just save each frame when it arrives I will get dropped frames as is demonstrated by the bundled freenect/record.c application.
I rewrote the application to use one thread that grabs the frames from the device and pushes them to the back of a double ended list (std::deque). Then there are two threads that each pop frames from the front of the double ended list and saves the frames to disk.
When the recording is turned off, there is a potentially large number of frames left in the list that still need to be recorded, so before exiting we let the two save threads do their work until finished.
Now the actual problem
Although the problem of dropped frames is solved, writes to the filesystem are still quite slow. Is there any good way to speed up the file creation on disk?
Currently, the function dump_frame looks like this:
static void
dump_frame(struct frame* frame)
{
FILE* fp;
char filename[512]; /* plenty of space! */
sprintf(filename, "d-%f-%u.pgm", get_time, frame->timestamp);
fp = fopen(filename, "w");
fprintf(fp, "P5 %d %d 65535\n", frame->width, frame->height);
fwrite(frame->data, frame->size, 1, fp);
fclose(fp);
}
I am running Fedora 14 x64, so the solution only have to concern Linux as operating system.

You need to measure what takes time in your specific case. Is it creating multiple files or actually writing the image data to disk?
When I tested on my local system with OSX and an Intel SSD X25M 2G I noticed a huge variation in writes when writing multiple 1MB files vs writing 1 multi MB file. This is probably due to housekeeping of the filesystem and will vary depending on the file system you have.
To avoid the housekeeping you could site all your images to the same file and split it later. However, the data you are saving needs about 60MB sustained speed which is quite high.
An alternative if you have a lot of memory is to create a ram disk and store the images there first and later move them on to the persistent file system. With a 6GB ram disk you could store about 100 seconds of video.

A possible improvement would to explicitly set the buffering of fp to full using setvbuf:
const size_t BUFFER_SIZE = 1024 * 16;
fp = fopen(filename, "w");
setvbuf(fp, 0, _IOFBF, BUFFER_SIZE)); /* Must be immediately after the open. */
fprintf(fp, "P5 %d %d 65535\n", frame->width, frame->height);
fwrite(frame->data, frame->size, 1, fp);
fclose(fp);
You could profile using different buffer sizes to determine which provides the best performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js