What does "Disk Profiling" mean (related to hard disks)? [duplicate] - profiling

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?

You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.

To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....

What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.

Related

Why do file IO speeds change when reading file size?

I'm creating a memory analysis program in C++ on Windows 10 using a 7200rpm HDD that essentially scans your drive and reports back on which folders are using how much space, allowing you to figure out where most of your drive storage is being used.
For efficiency reasons, I'm using C++ and my methodology is scanning the entire drive recursively, then reading each file's size in another thread so that I can both scan and analyze size at the same time. For obvious reasons, scanning is much faster than reporting on size but I've noticed that the IO speeds jump around a lot. Sometimes it'll read the size of 5000 files/second, whereas other times it'll read 10 files/second. Take a look at the video at this link. The first number is how many files' sizes have been read and the second number is how many files have been found altogether. The first number is what's important here.
Why does my file IO speed change, and is there anything I can do about this?
You have many bottlenecks to consider, both on the processor side and on the hard drive side.
Locating The Data
Essentially, the hard drive has to locate the sectors and tracks that contain the data in the file. If you are really lucky, the data will be in sequential sectors on sequential tracks, thus causing very little head movement or repositioning. However, file data can be "scattered", and thus the hard drive will read as much as it can, then calculate the next position of the data, relocate the head to that position and keep reading. This affects the flow of the data. If you drive is intelligent and has a lot of cache, the drive could place this data into a cache and deliver data from the cache instead of the drive, possibly making up some lost nanoseconds due to repositioning.
The Data Bus
The data has to go into the PC's memory. Usually there is only one bus for the data. This bus is shared among many entities in your system, the processor and the hard-drive controller to name a few. If your lucky, your PC has a Direct Memory Access (DMA) controller for the hard drive. The controller can transfer data from the hard drive port into memory, bypassing the processor. However, the DMA controller must share the data bus with the processor (and friends). The bus arbitration is another slowdown and inconsistency.
Sharing the Drive
Many operating systems use the hard drive as virtual memory; swapping out blocks of memory. These file requests will need to be intermingled with the requests from your program.
Sequential Access
Most of the cheaper platforms have sequential access to the drive. Only one entity can read at the same time. Most drives are a single bit stream. Higher performance, custom platforms, actually have more than one drive running in parallel. Because of the sequential nature of the device, entities must either wait for another to finish or intermingle the transactions. Compared to memory that is parallel access (8 or more bits read at the same time).
Interruptions & Scheduling
There are lots of activities going on inside your PC, from internet or wifi communications to audio and video playbacks (as well as other system tasks running). These all need to run. No matter how many cores you have, there isn't enough. Most Operating Systems will run the tasks by time and priority. Very rarely will one task have exclusive ownership of a processor until the task finishes. Your task will be intermingled with other tasks that are running. Thus slowing down your program.
Chunking It
Most disk clean up utilities work in chunks or pieces of files. Speed is not as important as the quality of the data operation. For example, a smaller chunk of a file will have better success at being moved or copied than a huge chunk. The program can be interrupted (from a User, for example). Smaller chunks allow for easier recovery from an interruption.
There are probably more reasons why your program is executing slowly or has inconsistent timings, but the above information should give you better insight as to the behavior of your PC.

performance of fread/fwrite while doing operation with binary file

I am writing some binary data into a binary file through fwrite and once i am through with writing i am reading back the same data thorugh fread.While doing this i found that fwrite is taking less time to write whole data where as fread is taking more time to read all data.
So, i just want to know is it fwrite always takes less time than fread or there is some issue with my reading portion.
Although, as others have said, there are no guarantees, you'll typically find that a single write will be faster than a single read. The write will be likely to copy the data into a buffer and return straight away, while the read will be likely to wait for the data to be fetched from the storage device. Sometimes the write will be slow if the buffers fill up; sometimes the read will be fast if the data has already been fetched. And sometimes one of the many layers of abstraction between fread/fwrite and the storage hardware will decide to go off into its own little world for no apparent reason.
The C++ language makes no guarantees on the comparative performance of these (or any other) functions. It is all down to the combination of hardware and operating system, the load on the machine and the phase of the moon.
These functions interact with the operating system's file system cache. In many cases it is a simple memory-to-memory copy. Write could indeed be marginally faster if you run your program repeatedly. It just needs to find a hole in the cache to dump its data. Flushing that data to the disk happens at a time you can't see or measure.
More work is usually needed to read. At a minimum it needs to traverse the cache structure to discover if the disk data is already cached. If not, it is going to have to block on a disk driver request to retrieve the data from the disk, that takes many milliseconds.
The standard trap with profiling this behavior is taking measurements from repeated runs of your program. They are not at all representative for the way your program is going to behave in the wild. The odds that the disk data is already cached are very good on the second run of your program. They are very poor in real life, reads are likely to be very slow, especially the first one. An extra special trap exists for a write, at some point (depending on the behavior of other programs too), the cache is not going to be able to buffer the write request. Write performance is then going to fall of a cliff as your program gets blocked until enough data is flushed to the disk.
Long story short: don't ever assume disk read/write performance measurements are representative for how your program will behave in production. And perhaps more to the point: there isn't anything you can do to solve disk I/O perf problems in your code.
You are seeing some effect of the buffer/cache systems as other have said, however, if you use async API (as you said your suing fread/write you should look at aio_read/aio_write) you can experiment with some other methods for I/O which are likely more well optimized for what your doing.
One suggestion is that if you are read/update/write/reading a file a lot, you should, by way of an ioctl or DeviceIOControl, request to the OS to provide you the geometry of the disk your code is running on, then determine the size of a disk cylander so you may be able to determine if you can do your read/write operations buffered inside of a single cylinder. This way, the drive head will not move for your read/write and save you a fair amount of run time.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

How best to manage Linux's buffering behavior when writing a high-bandwidth data stream?

My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.
You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.
If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.
You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.
If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).
I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.
Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().
You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.

Profiling disk access

Currently I am working on a MFC application which reads and writes in to the disk. Sometimes this application runs amazingly fast and sometimes it is damn slow. I am guessing that it is because of the disk access involved, hence I want to profile it. These are some questions in this regard:
(1).Currently I am using AQTime profiler to profile the application. Has anybody tried profiling disk access using this? or is there any other tool available which I can use?
(2). What are the most important disk parameters I should be looking at?
(3). If I have multiple threads trying to read and write the data from disk does it affect the performance? i.e. am I better off having a single threaded access to the disk?
You can use the Windows Performance Toolkit for this. You can enable trace providers for disk I/O events and see the I/O time and disk service time for each. It does have a bit of a learning curve though. This will also let you determine which file I/O's actually result in real-access to the disk and aren't handled by the cache manager.
Most important parameters are disk service time and queue length. Disk service time is how long the disk actually took to service the request. Queue length indicates if your disk request is backed up behind other requests.
For many threads w/ reads & writes - Many disks have poor performance in the face of reads with background writes. If you have various threads doing lots of disk I/O to random locations on the disk, you may wind up starving certain requests.
To help you with (2):
Try to batch up your writes to disk to avoid many small calls to write. When you're done flushing your buffer, call commit. commit (aka fsync) is an expensive operation, so becomes even more so when there are lots of small writes.
On windows file handles you can experiment with FILE FLAG WRITE THROUGH to increase write speeds. Supposedly commit doesn't have to be called with handles using this flag.
If data you are writing to disk will also be accessed through reading, consider writing to an in memory structure first, having another thread read from the structure to write it to disk. This will help avoid calls to read data from disk that you have just written.
Hopefully this helps....
What I would do is, if you can't pause all threads at the same time and examine their state, focus on one of them and pause that, while it's being "damn slow". This is a little known but effective technique.
Since it is being extremely slow compared to what it could be, whatever it is waiting for it is waiting for probably 99% of the time, so when you pause it you will see it. That's true whether it's one big wait, or a zillion little ones. Look at the whole call stack. The culprit may be somewhere in the middle of the stack.
If you're not sure, pause it two or three times. The culprit will be on all stack samples.