Why do file IO speeds change when reading file size? - c++

I'm creating a memory analysis program in C++ on Windows 10 using a 7200rpm HDD that essentially scans your drive and reports back on which folders are using how much space, allowing you to figure out where most of your drive storage is being used.
For efficiency reasons, I'm using C++ and my methodology is scanning the entire drive recursively, then reading each file's size in another thread so that I can both scan and analyze size at the same time. For obvious reasons, scanning is much faster than reporting on size but I've noticed that the IO speeds jump around a lot. Sometimes it'll read the size of 5000 files/second, whereas other times it'll read 10 files/second. Take a look at the video at this link. The first number is how many files' sizes have been read and the second number is how many files have been found altogether. The first number is what's important here.
Why does my file IO speed change, and is there anything I can do about this?

You have many bottlenecks to consider, both on the processor side and on the hard drive side.
Locating The Data
Essentially, the hard drive has to locate the sectors and tracks that contain the data in the file. If you are really lucky, the data will be in sequential sectors on sequential tracks, thus causing very little head movement or repositioning. However, file data can be "scattered", and thus the hard drive will read as much as it can, then calculate the next position of the data, relocate the head to that position and keep reading. This affects the flow of the data. If you drive is intelligent and has a lot of cache, the drive could place this data into a cache and deliver data from the cache instead of the drive, possibly making up some lost nanoseconds due to repositioning.
The Data Bus
The data has to go into the PC's memory. Usually there is only one bus for the data. This bus is shared among many entities in your system, the processor and the hard-drive controller to name a few. If your lucky, your PC has a Direct Memory Access (DMA) controller for the hard drive. The controller can transfer data from the hard drive port into memory, bypassing the processor. However, the DMA controller must share the data bus with the processor (and friends). The bus arbitration is another slowdown and inconsistency.
Sharing the Drive
Many operating systems use the hard drive as virtual memory; swapping out blocks of memory. These file requests will need to be intermingled with the requests from your program.
Sequential Access
Most of the cheaper platforms have sequential access to the drive. Only one entity can read at the same time. Most drives are a single bit stream. Higher performance, custom platforms, actually have more than one drive running in parallel. Because of the sequential nature of the device, entities must either wait for another to finish or intermingle the transactions. Compared to memory that is parallel access (8 or more bits read at the same time).
Interruptions & Scheduling
There are lots of activities going on inside your PC, from internet or wifi communications to audio and video playbacks (as well as other system tasks running). These all need to run. No matter how many cores you have, there isn't enough. Most Operating Systems will run the tasks by time and priority. Very rarely will one task have exclusive ownership of a processor until the task finishes. Your task will be intermingled with other tasks that are running. Thus slowing down your program.
Chunking It
Most disk clean up utilities work in chunks or pieces of files. Speed is not as important as the quality of the data operation. For example, a smaller chunk of a file will have better success at being moved or copied than a huge chunk. The program can be interrupted (from a User, for example). Smaller chunks allow for easier recovery from an interruption.
There are probably more reasons why your program is executing slowly or has inconsistent timings, but the above information should give you better insight as to the behavior of your PC.

Related

What does "Disk Profiling" mean (related to hard disks)? [duplicate]

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

Reading a file in C++

I am writing application to monitor a file and then match some pattern in that file.
I want to know what is the fastest way to read a file in C++
Is reading line by line is faster of reading chunk of the file is faster.
Your question is more about the performance of hardware, operating systems and run time libraries than it has to do with programming languages. When you start reading a file, the OS is probably loading the file in chunks anyway since the file is stored that way on disk, it makes sense for the OS to load each chunk entirely on first access and caching it rather than reading the chunk, extracting the requested data and discarding the rest.
Which is faster? Line by line or chunk at a time? As always with these things, the answer is not something you can predict, the only way to know for sure is to write a line-by-line version and a chunk-at-a-time version and profile them (measure how long it takes each version).
In general, reading large amounts of a file into a buffer, then parsing the buffer is a lot faster than reading individual lines. The actual proof is to profile code that reads line by line, then profile code reading in large buffers. Compare the profiles.
The foundation for this justification is:
Reduction of I/O Transactions
Keeping the Hard Drive Spinning
Parsing Memory Is Faster
I improved the performance of one application from 65 minutes down to 2 minutes, by appling these techniques.
Reduction of I/O Transactions
Reducing the I/O transactions results in few calls to the operating system, reducing time there. Reducing the number of branches in your code; improving the performance of the instruction pipeline in your processor. And also reduces traffic to the hard drive. The hard drive has less commands to process so it has less overhead.
Keeping the Hard Drive Spinning
To access a file, the hard drive has to ramp up the motors to a decent speed (which takes time), position the head to the desired track and sector, and read the data. Positioning the head and ramping up the motor is overhead time required by all transactions. The overhead in reading the data is very little. The objective is to read as much data as possible in one transaction because this is where the hard drive is most efficient. Reducing the number of transactions will reduce the wait times for ramping up the motors and positioning the heads.
Although modern computers have caches for both data and commands, reducing the quantity will speed things up. Larger "payloads" will allow more efficient use of the their caches and not require overhead of sorting the requests.
Parsing Memory Is Faster
Always, reading from memory is faster than reading from an external source. Reading a second line of text from a buffer requires incrementing a pointer. Reading a second line from a file requires an I/O transaction to get the data into memory. If your program has memory to spare, haul the data into memory then search the memory.
Too Much Data Negates The Performance Savings
There is a finite amount of RAM on the computer for applications to share. Accessing more memory than this memory may cause the computer to "page" or forward the request to the hard drive (as known as virtual memory). In this case, there may be little savings because the hard drive is accessed anyway (by the Operating System without knowledge by your program). Profiling will give you a good indication as to the optimum size of the data buffer.
The application I optimized was reading one byte at a time from a 2 GB file. The performance greatly improved when I changed the program to read 1 MB chunks of data. This also allowed for addition performance with loop unrolling.
Hope this helps.
You could try to map the file directly to memory using a memory-mapped-file, and then use standard C++ logic to find the patterns that you want.
The OS (or even the C++ class you use) probably reads the file in chunks and caches it, even if you read it line by line to improve performance on minimizing disk access (on the operational system point of view would be faster for it to read data from a memory buffer than from a hard disk device).
Notice that a good way to improve performance on your programs (if it is really time critical), is to minimize the number of calls to operational system functions (which manage its resources).

Performance issues with hard disk reading

I have a C++ program which reads files from the hard disk and does some processing on the data in the files. I am using standard Win32 APIs to read the files. My problem is that this program is blazingly fast some times and then suddenly slows down to 1/6th of the previous speed. If I read the same files again and again over multiple runs, then normally the first run will be the slowest one. Then it maintains the speed until I read some other set of files. So my obvious guess was to profile the disk access time. I used perfmon utility and measured the IO Read Bytes/sec for my program. And as expected there was a huge difference (~ 5 times) in the number of bytes read. My questions are:
(1). Does OS (Windows in my case) cache the recently read files somewhere so that the subsequent loads are faster?
(2). If I can guarantee that all the files I read reside in the same directory then is there any way I can place them in the hard disk so that my disk access time is faster?
Is there anything I can do for this?
1) Windows does cache recently read files in memory. The book Windows Internals includes an excellent description of how this works. Modern versions of Windows also use a technology called SuperFetch which will try to preemptively fetch disk contents into memory based on usage history and ReadyBoost which can cache to a flash drive, which allows faster random access. All of these will increase the speed with which data is accessed from disk after the initial run.
2) Directory really doesn't affect layout on disk. Defragmenting your drive will group file data together. Windows Vista on up will automatically defragment your disk. Ideally, you want to do large sequential reads and minimize your writes. Small random accesses and interleaving writes with reads significantly hurts performance. You can use the Windows Performance Toolkit to profile your disk access.
Your numbered questions seem to be answered already. If you're still wondering what you can do to improve hard drive read speed, here are some tips:
Read with the OS functions (e.g., ReadFile) rather than wrapper libraries (like iostreams or stdio) if possible. Many wrappers introduce more levels of buffering.
Read sequentially, and let Windows know you're going to read sequentially with the FILE_FLAG_SEQUENTIAL_SCAN flag.
If you're only going to read (and not write), be sure to open the file just for reading.
Read in chunks, not bytes or characters.
Ideally the chunks should be multiples of the disk's cluster size.
Read from the disc at cluster-aligned offsets.
Read to memory at page-boundaries. (If you're allocating a big chunk, it's probably page aligned.)
Advanced: If you can start your computation after reading just the beginning the file, then you can used overlapped I/O to try to parallelize the computation and the subsequent reads as much as possible.
Yes, Windows (and most modern OS's) keep recently read file data in otherwise unused RAM so that if that file data is requested again in the near future it will already be available in RAM and disk access can be avoided.
As far as making disk access faster, you could try defragmenting your drive, but I wouldn't expect it to help too much. Drive access is just slow compared to RAM access, which is why RAM caching provides such a nice speedup.
As a diagnostic test, can you accurately measure the time it takes to load the very first time?
Then take that to determine the transfer rate. Then you can take that transfer rate and compare that to what you get when running HD Tune. For what it's worth, I ran this myself and got 44.2 MB/s minimum, 87 MB/s average, 110 MB/s max read speeds with my Western Digital RE3 drive (one of the faster 7200 RPM SATA drives available).
The point of all this is to see if your own application is doing the best it can. In other words, aside from caching you can't really read the files any faster than what your hard drive is capable of. So if you're reaching that limit then there's nothing more to do.
Also, make sure that you are not running out of memory during your tests. Run perfmon and monitor Memory > Available Bytes and PhysicalDisk > Disk Read Bytes/sec for the physical drive you are reading. Monitoring process' I/O is a good idea too. Keep in mind that the latter combines all I/O (network included).
You should expect 50 MB/s for sequential reads from a single average SATA drive. A couple of good striped serial SCSI drive will give you about 220 MB/s. If you are seeing available memory going to near zero, that would be your problem. If it stays flat after you did the first round of reading than it has something to do with your app.
A Microsoft utility called contig can be used to defragment a single file on disk or to create a new unfragmented file.
For the crazy answer, you could try formatting the drive such that you place your info on the fastest portion, and see if that helps any.
Tom's Hardware had a review on how that might be done.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

Profiling disk access

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.