Reading a file in C++ - c++

I am writing application to monitor a file and then match some pattern in that file.
I want to know what is the fastest way to read a file in C++
Is reading line by line is faster of reading chunk of the file is faster.

Your question is more about the performance of hardware, operating systems and run time libraries than it has to do with programming languages. When you start reading a file, the OS is probably loading the file in chunks anyway since the file is stored that way on disk, it makes sense for the OS to load each chunk entirely on first access and caching it rather than reading the chunk, extracting the requested data and discarding the rest.
Which is faster? Line by line or chunk at a time? As always with these things, the answer is not something you can predict, the only way to know for sure is to write a line-by-line version and a chunk-at-a-time version and profile them (measure how long it takes each version).

In general, reading large amounts of a file into a buffer, then parsing the buffer is a lot faster than reading individual lines. The actual proof is to profile code that reads line by line, then profile code reading in large buffers. Compare the profiles.
The foundation for this justification is:
Reduction of I/O Transactions
Keeping the Hard Drive Spinning
Parsing Memory Is Faster
I improved the performance of one application from 65 minutes down to 2 minutes, by appling these techniques.
Reduction of I/O Transactions
Reducing the I/O transactions results in few calls to the operating system, reducing time there. Reducing the number of branches in your code; improving the performance of the instruction pipeline in your processor. And also reduces traffic to the hard drive. The hard drive has less commands to process so it has less overhead.
Keeping the Hard Drive Spinning
To access a file, the hard drive has to ramp up the motors to a decent speed (which takes time), position the head to the desired track and sector, and read the data. Positioning the head and ramping up the motor is overhead time required by all transactions. The overhead in reading the data is very little. The objective is to read as much data as possible in one transaction because this is where the hard drive is most efficient. Reducing the number of transactions will reduce the wait times for ramping up the motors and positioning the heads.
Although modern computers have caches for both data and commands, reducing the quantity will speed things up. Larger "payloads" will allow more efficient use of the their caches and not require overhead of sorting the requests.
Parsing Memory Is Faster
Always, reading from memory is faster than reading from an external source. Reading a second line of text from a buffer requires incrementing a pointer. Reading a second line from a file requires an I/O transaction to get the data into memory. If your program has memory to spare, haul the data into memory then search the memory.
Too Much Data Negates The Performance Savings
There is a finite amount of RAM on the computer for applications to share. Accessing more memory than this memory may cause the computer to "page" or forward the request to the hard drive (as known as virtual memory). In this case, there may be little savings because the hard drive is accessed anyway (by the Operating System without knowledge by your program). Profiling will give you a good indication as to the optimum size of the data buffer.
The application I optimized was reading one byte at a time from a 2 GB file. The performance greatly improved when I changed the program to read 1 MB chunks of data. This also allowed for addition performance with loop unrolling.
Hope this helps.

You could try to map the file directly to memory using a memory-mapped-file, and then use standard C++ logic to find the patterns that you want.

The OS (or even the C++ class you use) probably reads the file in chunks and caches it, even if you read it line by line to improve performance on minimizing disk access (on the operational system point of view would be faster for it to read data from a memory buffer than from a hard disk device).
Notice that a good way to improve performance on your programs (if it is really time critical), is to minimize the number of calls to operational system functions (which manage its resources).

Related

Why do file IO speeds change when reading file size?

I'm creating a memory analysis program in C++ on Windows 10 using a 7200rpm HDD that essentially scans your drive and reports back on which folders are using how much space, allowing you to figure out where most of your drive storage is being used.
For efficiency reasons, I'm using C++ and my methodology is scanning the entire drive recursively, then reading each file's size in another thread so that I can both scan and analyze size at the same time. For obvious reasons, scanning is much faster than reporting on size but I've noticed that the IO speeds jump around a lot. Sometimes it'll read the size of 5000 files/second, whereas other times it'll read 10 files/second. Take a look at the video at this link. The first number is how many files' sizes have been read and the second number is how many files have been found altogether. The first number is what's important here.
Why does my file IO speed change, and is there anything I can do about this?
You have many bottlenecks to consider, both on the processor side and on the hard drive side.
Locating The Data
Essentially, the hard drive has to locate the sectors and tracks that contain the data in the file. If you are really lucky, the data will be in sequential sectors on sequential tracks, thus causing very little head movement or repositioning. However, file data can be "scattered", and thus the hard drive will read as much as it can, then calculate the next position of the data, relocate the head to that position and keep reading. This affects the flow of the data. If you drive is intelligent and has a lot of cache, the drive could place this data into a cache and deliver data from the cache instead of the drive, possibly making up some lost nanoseconds due to repositioning.
The Data Bus
The data has to go into the PC's memory. Usually there is only one bus for the data. This bus is shared among many entities in your system, the processor and the hard-drive controller to name a few. If your lucky, your PC has a Direct Memory Access (DMA) controller for the hard drive. The controller can transfer data from the hard drive port into memory, bypassing the processor. However, the DMA controller must share the data bus with the processor (and friends). The bus arbitration is another slowdown and inconsistency.
Sharing the Drive
Many operating systems use the hard drive as virtual memory; swapping out blocks of memory. These file requests will need to be intermingled with the requests from your program.
Sequential Access
Most of the cheaper platforms have sequential access to the drive. Only one entity can read at the same time. Most drives are a single bit stream. Higher performance, custom platforms, actually have more than one drive running in parallel. Because of the sequential nature of the device, entities must either wait for another to finish or intermingle the transactions. Compared to memory that is parallel access (8 or more bits read at the same time).
Interruptions & Scheduling
There are lots of activities going on inside your PC, from internet or wifi communications to audio and video playbacks (as well as other system tasks running). These all need to run. No matter how many cores you have, there isn't enough. Most Operating Systems will run the tasks by time and priority. Very rarely will one task have exclusive ownership of a processor until the task finishes. Your task will be intermingled with other tasks that are running. Thus slowing down your program.
Chunking It
Most disk clean up utilities work in chunks or pieces of files. Speed is not as important as the quality of the data operation. For example, a smaller chunk of a file will have better success at being moved or copied than a huge chunk. The program can be interrupted (from a User, for example). Smaller chunks allow for easier recovery from an interruption.
There are probably more reasons why your program is executing slowly or has inconsistent timings, but the above information should give you better insight as to the behavior of your PC.

Reading inputs in smaller, more frequent reads or one larger read

I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?
The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.
The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.

Which are the best practices for data intensive reading and writing in a HD?

I'm developing a C++ application (running in a Linux box) that is very intensive in reading log files and writing derived results in disk. I'd like to know which are the best practices for optimizing these kind of applications:
Which OS tweaks improve performance?
Which programming patterns boost IO throughput?
Is pre-processing the data (convert to binary, compress data, etc...) a helpful measure?
Does chunking/buffering data helps in performance?
Which hardware capabilities should I be aware of?
Which practices are best for profiling and measuring performance in these applications?
(express here the concern I'm missing)
Is there a good read where I could get the basics of this so I could adapt the existing know-how to my problem?
Thanks
Compression may certainly help a lot and is much simpler than tweaking the OS. Check out the gzip and bzip2 support in the Boost.IOStreams library. This takes its toll on the processor, though.
Measuring these kinds of jobs starts with the time command. If system time is very high compared to user time, then your program spends a lot of time doing system calls. If wall-clock ("real") time is high compared to system and user time, it's waiting for the disk or the network. The top command showing significantly less than 100% CPU usage for the program is also a sign of I/O bottleneck.
1) Check out your disk's sector size.
2) Make sure the disk is defragged.
3) Read data that is "local" to the last reads you have done to improve cache locality (Cacheing is performend by the operating system and many hard drives also have a built-in cache).
4) Write data contiguously.
For write performance, Cache blocks of data in memory until you reach a multiple of the sector size then initiate an asynchronous write to disk. Do not overwrite the data currently being written until you can be definite the data has been written (ie sync the write). Double or triple buffering can help here.
For best read performance you can double buffer reads. So lets say you cache 16K blocks on read. Read the 1st 16K from disk into block 1. Initiate an asynchronous read of the 2nd 16K into block 2. Start working on block 1. When you have finished with block 1 sync the read of block 2 and start an async read into block 1 of the 3rd 16K block into block 1. Now work on block 2. When finished sync the read of the 3rd 16K block, initiate an async read of the 4th 16K into block 2 and work on block 1. Rinse and repeat until you have processed all the data.
As already stated the less data you have to read the less time will be lost to reading from disk so it may well be worth reading compressed data and spending the CPU time expanding each block on read. Equally compressing the block before write will save you disk time. Whether this is a win or not really will depend on how CPU intensive your processing of the data is.
Also if the processing on the blocks is asymmetric (ie processing block 1 can take 3 times as long as processing block 2) then consider triple or more buffering for reads.
Get information about the volume you'll be writing to/reading from and create buffers that match the characteristics of the volume. e.g. 10 * clusterSize.
Buffering helps a lot, as would minimizing the amount of writing you have to do.
As it was stated here, you should check size of block. You do this with stat family functions.
In struct stat this information is located in field st_blksize.
Second thing is function posix_fadvise(), which gives advice to OS about paging. You tell system how you're going to use file (or even fragment of a file). You'll find more on manual page.
On windows, use CreateFile() with FILE_FLAG_SEQUENTIAL_SCAN and/or FILE_FLAG_NO_BUFFERING rather than fopen() - at least for writing this returns immediately rather than waiting for the data to flush to disk

Performance issues with hard disk reading

I have a C++ program which reads files from the hard disk and does some processing on the data in the files. I am using standard Win32 APIs to read the files. My problem is that this program is blazingly fast some times and then suddenly slows down to 1/6th of the previous speed. If I read the same files again and again over multiple runs, then normally the first run will be the slowest one. Then it maintains the speed until I read some other set of files. So my obvious guess was to profile the disk access time. I used perfmon utility and measured the IO Read Bytes/sec for my program. And as expected there was a huge difference (~ 5 times) in the number of bytes read. My questions are:
(1). Does OS (Windows in my case) cache the recently read files somewhere so that the subsequent loads are faster?
(2). If I can guarantee that all the files I read reside in the same directory then is there any way I can place them in the hard disk so that my disk access time is faster?
Is there anything I can do for this?
1) Windows does cache recently read files in memory. The book Windows Internals includes an excellent description of how this works. Modern versions of Windows also use a technology called SuperFetch which will try to preemptively fetch disk contents into memory based on usage history and ReadyBoost which can cache to a flash drive, which allows faster random access. All of these will increase the speed with which data is accessed from disk after the initial run.
2) Directory really doesn't affect layout on disk. Defragmenting your drive will group file data together. Windows Vista on up will automatically defragment your disk. Ideally, you want to do large sequential reads and minimize your writes. Small random accesses and interleaving writes with reads significantly hurts performance. You can use the Windows Performance Toolkit to profile your disk access.
Your numbered questions seem to be answered already. If you're still wondering what you can do to improve hard drive read speed, here are some tips:
Read with the OS functions (e.g., ReadFile) rather than wrapper libraries (like iostreams or stdio) if possible. Many wrappers introduce more levels of buffering.
Read sequentially, and let Windows know you're going to read sequentially with the FILE_FLAG_SEQUENTIAL_SCAN flag.
If you're only going to read (and not write), be sure to open the file just for reading.
Read in chunks, not bytes or characters.
Ideally the chunks should be multiples of the disk's cluster size.
Read from the disc at cluster-aligned offsets.
Read to memory at page-boundaries. (If you're allocating a big chunk, it's probably page aligned.)
Advanced: If you can start your computation after reading just the beginning the file, then you can used overlapped I/O to try to parallelize the computation and the subsequent reads as much as possible.
Yes, Windows (and most modern OS's) keep recently read file data in otherwise unused RAM so that if that file data is requested again in the near future it will already be available in RAM and disk access can be avoided.
As far as making disk access faster, you could try defragmenting your drive, but I wouldn't expect it to help too much. Drive access is just slow compared to RAM access, which is why RAM caching provides such a nice speedup.
As a diagnostic test, can you accurately measure the time it takes to load the very first time?
Then take that to determine the transfer rate. Then you can take that transfer rate and compare that to what you get when running HD Tune. For what it's worth, I ran this myself and got 44.2 MB/s minimum, 87 MB/s average, 110 MB/s max read speeds with my Western Digital RE3 drive (one of the faster 7200 RPM SATA drives available).
The point of all this is to see if your own application is doing the best it can. In other words, aside from caching you can't really read the files any faster than what your hard drive is capable of. So if you're reaching that limit then there's nothing more to do.
Also, make sure that you are not running out of memory during your tests. Run perfmon and monitor Memory > Available Bytes and PhysicalDisk > Disk Read Bytes/sec for the physical drive you are reading. Monitoring process' I/O is a good idea too. Keep in mind that the latter combines all I/O (network included).
You should expect 50 MB/s for sequential reads from a single average SATA drive. A couple of good striped serial SCSI drive will give you about 220 MB/s. If you are seeing available memory going to near zero, that would be your problem. If it stays flat after you did the first round of reading than it has something to do with your app.
A Microsoft utility called contig can be used to defragment a single file on disk or to create a new unfragmented file.
For the crazy answer, you could try formatting the drive such that you place your info on the fastest portion, and see if that helps any.
Tom's Hardware had a review on how that might be done.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.