This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you determine the ideal buffer size when using FileInputStream?
When reading raw data from a file (or any input stream) using either the C++'s istream family's read() or C's fread(), a buffer has to be supplied, and a number of how much data to read. Most programs I have seen seem to arbitrarily chose a power of 2 between 512 and 4096.
Is there a reason it has to/should be a power of 2, or this just programer's natural inclination to powers of 2?
What would be the "ideal" number? By "ideal" I mean that it would be the fastest. I assume it would have to be a multiple of the underlying device's buffer size? Or maybe of the underlying stream object's buffer? How would I determine what the size of those buffers is, anyway? And once I do, would using a multiple of it give any speed increase over just using the exact size?
EDIT
Most answers seem to be that it can't be determined at compile time. I am fine with finding it at runtime.
SOURCE:
How do you determine the ideal buffer size when using FileInputStream?
Optimum buffer size is related to a number of things: file system
block size, CPU cache size and cache latency.
Most file systems are configured to use block sizes of 4096 or 8192.
In theory, if you configure your buffer size so you are reading a few
bytes more than the disk block, the operations with the file system
can be extremely inefficient (i.e. if you configured your buffer to
read 4100 bytes at a time, each read would require 2 block reads by
the file system). If the blocks are already in cache, then you wind up
paying the price of RAM -> L3/L2 cache latency. If you are unlucky and
the blocks are not in cache yet, the you pay the price of the
disk->RAM latency as well.
This is why you see most buffers sized as a power of 2, and generally
larger than (or equal to) the disk block size. This means that one of
your stream reads could result in multiple disk block reads - but
those reads will always use a full block - no wasted reads.
Ensuring this also typically results in other performance friendly parameters affecting both reading and subsequent processing: data bus width alignment, DMA alignment, memory cache line alignment, whole number of virtual memory pages.
At least in my case, the assumption is that the underlying system is using a buffer whose size is a power of two, too, so it's best to try and match. I think nowadays buffers should be made a bit bigger than what "most" programmers tend to make them. I'd go with 32 KB rather than 4, for instance.
It's very hard to know in advance, unfortunately. It depends on whether your application is I/O or CPU bound, for instance.
I think that mostly it's just choosing a "round" number. If computers worked in decimal we'd probably choose 1000 or 10000 instead of 1024 or 8192. There is no very good reason.
One possible reason is that disk sectors are usually 512 bytes in size so reading a multiple of that is more efficient, assuming that all the hardware layers and caching cause the low level code to actually be able to use this fact efficiently. Which it probably can't unless you are writing a device driver or doing an unbuffered read.
No reason that I know of that it has to be a power of two. You are constrained by the buffer size having to be within max size_t but this is unlikely to be an issue.
Clearly the bigger the buffer the better but this is obviously not scalable so some account must be taken of system resource considerations either at compile time or preferably at runtime.
1 . Is there a reason it has to/should be a power of 2, or this just programer's natural inclination to powers of 2?
Not really. It should probably be something that goes even in the size of the data bus width to simplify memory copy, so anything that divies into 16 would be good with the current technology. Using a power of 2 makes it likely that it will work well with any future technology.
2 . What would be the "ideal" number? By "ideal" I mean that it would be the fastest.
The fastest would be as much as possible. However, once you go over a few kilobytes you will have a very small performance difference compared to the amount of memory that you use.
I assume it would have to be a multiple of the
underlying device's buffer size? Or maybe of the underlying stream
object's buffer? How would I determine what the size of those buffers
is, anyway?
You can't really know the size of the underlying buffers, or depend on that they remain the same.
And once I do, would using a multiple of it give any speed
increase over just using the exact size?
Some, but very little.
I think Ideal size of Buffer is size of one block in your hard drive,so it can map properly with your buffer while storing or fetching data from hard drive.
Related
I'm writing a C++ program for AES encryption with CTR block chaining, but my question doesn't require knowledge of either.
I'm wondering how much of a file i should buffer to encrypt and output to the new encrypted file. I ask this because i know disk reads are quite expensive so it only makes sense i should, if possible, read and buffer the entire original file, encrypt, output to new file. However, if the file is 1gb, i don't want to reserve a whole 1gb in main memory for the during of the encryption.
So, im curious what the optimal buffer size is? For example, buffering 100mb and performing 10 iterations of encryption to process the entire 1gb file. Thanks.
Memory map the file and let the system figure out the right buffer size.
Usually the file is buffered into main memory anyway (on server and desktop systems). So the buffer size in your application can be kept relatively small. 1 MiB would be plenty and would probably not matter much on any system with 1 GiB of main memory or more.
On embedded systems that do not buffer memory it may require some figuring out what is happening underneath and how much memory needs to be taken. I would consider a buffer of about 1-8 KiB a good minimum requirement. If you go lower than that and you might want to time the AES operations as well.
To make sure you can optimize later on, you may want to keep to make the buffer a multiple of 64 bytes (the block size of AES is 16 bytes and that of SHA-2 512 is 64 bytes). In general, try and keep to full powers of two or as close to that as possible (one MiB is 2^20 bytes).
Who's telling you that "disk reads are quite expensive"? Unless you're processing terabytes of data the cost of IO is going to be so inconsequential you'll have a hard time measuring it. A 1MB buffer will be way more than what you need. I bet you'd have a hard time finding a benchmarkable difference between 64KB and 1MB or more.
The one exception to this is if you're reading a lot of data off of a really slow device, something like a NAS drive on a congested network, but even then I'd consider any effort to implement buffering to be a false optimization. In that case copy the data to a local drive, process it off of local storage.
C++ buffers input and output with reasonable defaults anyway, plus most operating systems will fetch blocks of data as you're reading sequentially in order to make retrieval efficient. Unless you have a very compelling reason, stick with the normal behaviour. There should be no need to write custom buffering code.
I need to read byte arrays from several locations of a big file.
I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible.
I have 20 calls like this one:
m_content.resize(iByteCount);
fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile);
iByteCount is around 5000 on average.
Before using fread, I used a memory-mapped file, but the results were approximately the same.
My calls are still too slow (around 200 ms) when called for the first time. When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me.
The file is big (around 200 mb).
After this call, I have to read double values from a different section of the file, but I can not avoid this.
I don't want to split it up in 2 files. I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow.
If I use memory-mapping, the first call of reading is always slow. If I then repeat reading from this section, it is lightening fast. When I then read from a different section, it is slow for the first time, but then lightening fast the second time.
I have no idea why this is so.
Does anybody have any more ideas for me?
Thank you.
Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth.
What you feel most is access time. Access time is typically in the millisecond ballpark. Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average").
Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB).
Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. Nor can you change the seek time. However, you can change the number of seeks.
In practice, this means you must avoid seeks as much as you can. If that means reading in data that you do not need, do not be afraid of doing so. It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB.
If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). But don't worry, you have already had the best solution to the problem: memory mapping.
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get.
The big advantage of memory mapping is that you can directly read from the buffer cache. Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". Also, what is stored in the buffer cache is in some sense "free".
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive).
While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. Reading from buffers works more or less in zero time.
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow).
Linux has the readahead syscall which lets you prefetch data. Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (i.e. 256kiB).
Do not have yourself being fooled by the docs, as the docs are deceptive. It may seem that MADV_RANDOM is a better choice, as your access is "random". It makes sense to be honest to the OS about what you're doing, doesn't it? Usually yes, but not here. This, simply turns off prefetching, which is the exact opposite of what you really want. I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance.
Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. On older versions, there is just... nothing.
A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. This sounds horrendous, but it works very nicely, and is operating-system agnostic.
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU.
(moved to an answer as requested by the OP)
You cannot read from a file any quicker (there is no magic flag to say "read faster"). There is either an issue with your hardware or 200mS is how long it is supposed to take
1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time. However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms).
Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. You can read the libc/OS/controller doc to try and align your reads on these chunks.
2) You're using stream input, try using direct open/read/close functions : low-level I/O have less overhead (obviously). Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue.
as it look you have a good benchmark, try to switch the size and the count in your fread call. reading 1 times 1000 bytes will be faster than 1000 x 1 byte.
Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. You're always going to pay that cost one time.
You could improve your performance a little by using memory mapped IO. See either mmap (Linux) or CreateFileMapping+MapViewOfFile (Windows).
I have already optimized the file so that as few sections as possible have to be read
Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest.
Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited.
Two possible ideas I had are: -
1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. You'd have to test if this benefits at all.
2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. You may be doing this already, but if not, I thought it worth mentioning.
In my function, I need to read some data from a file into a buffer, manipulate the data and write it back to another file. The file is of unknown size and may be very large.
If I use a small buffer, there will be a long read/write cycle and it would take much time. In contrast, long buffer means I need to consume more memory. What is the optimal buffer size I should use? Is this case dependent?
I saw some application like 'Tera copy' in windows that manages huge files efficiently. Is there any other technique or mechanism I should be aware of?
Note: This program will be running under Windows.
See what Microsoft has to say about IO size: http://technet.microsoft.com/en-us/library/cc938632.aspx. Basically, they say you should probably do IO in 64K blocks.
On *NIX platforms, struct stat has a st_blksize member which says what should be a minimal IO block size.
It is, indeed, highly case dependent, and you should probably just write your program to be able to handle a flexible buffer size, and then try out what size is optimal.
If you start small and then increase your buffer size, you will probably reach a certain size after which you'll see no or extremely small performance gains, since the CPU is spending most of its time running your code, and the overhead from the I/O has become negligible.
First rule for these things is to benchmark. My guess would be that you prematurely optimizing. If you are doing real file IO, the bandwidth of your disk (or whatever) will usually be the bottleneck. As long as you write your data in chunks of several pages the performance shouldn't change too much.
What you could hope to is to do your computation of parts of the data in parallel to your write operation. For this you would have to keep two buffers, one which is currently written, and one on which you do the processing. Then you would use asynchronous IO funcions (aio_write on POSIX systems, probably something like that exists for Windows, too) and switch buffers for each iteration.
Memory management is always case dependent and particularly when combined with file I/O.
There are two possible suggestions from my side.
1) Use fixed I/O buffer size, e.g. 64K, 256K, 512KB or 1MB. But in this case when there is I/O more than this fixed buffer size, you have to consider offsets to complete I/O in multiple iterations.
2) Use variable I/O buffer size using malloc(), but this also depends on certain factors. Such as available RAM in your system and maximum dynamic memory allocation limit for process in your OS.
I will suggest you to use buffer size of page size. For example is page size is 4K then you can use 4K Byte buffer size to minimize context switches.
While I cant speak for the algorithm... Memory usage versus processor usage is a classic dilemma in programming and you should probably choose on a case by case basis... So if the system has 4GB Available RAM you can obviously consume quite a bit, whereas if you only have 512MB you should consume very little at the cost of exercising the CPU. Best way would be to check and change your size pro grammatically :)
Let's say I want to write a 1 GB of data to the file on, say ext3 Linux filesystem using write(2) syscall and this happens in a very busy environment (many similar I/Os concurently). What is the optimal buffer size in the interval, say, [4 kB, 4 MB] to do that when
not using O_DIRECT open flag, or
using O_DIRECT?
Please, no "check it yourself" answers -- I'd like to get some answer from "filesystems" guys.
The answer is in my experience much more dependent on the underlying devices and hardware rather than the filesystem itself -- that is buffer caches on the device, and the capabilities of the device to write in small blocks etc -- however you should never write in smaller sizes than your file system block size (stat(.) -- likely to be about 4kb) -- similarly you should not really go beyond the L2/L3 cache size of the CPU which in many cases can be as low as 512kb.
Given that SSD devices and similar like the 64kb as the unit of operations, then I would suggest that a buffer size of 64kb-128kb being the most optimal -- which also correspond with my empirical experience as having the highest throughput.
As discussed in comments, I believe the exact size don't matter that much, assuming it is :
a small multiple of the file system size (see comment by Joachim Pileborg suggesting stat(".") etc.)
a power of two (because computers and kernels like them)
not too big (e.g. fitting in some cache inside your processor, e.g. L2 cache)
aligned in memory (e.g. to a page size using posix_memalign).
So a power of two between 16kbytes and a few megabytes should probably fit. Most of the time is spent on reading the disk. Filesystem and disk benchmarks are quite flat in that range.
4Kbytes seems to often be the page size and the disk chunk size.
Of course, you can tune things, even tune, when making the file system with mke2fs, the file system block size.
And I'll bet that the optimal is really dependent upon your hardware (SSD, hard disks?) and your system (and its load).
Including stdio.h should define BUFSIZ as the optimal size for the system. This is by no means guaranteed, but it is the right value to use if you do not have the ability to do extensive benchmarks, and it is a good starting point for such benchmarks.
I have two options concerning buffer sizes when reading files.
char* buffer = new char[aBlock];
myFile.read(buffer,aBlock);
and,
char* buffer = new char;
while (!myFIle.eof())
myFile.read(buffer,1);
Will there be a considerable time cost difference?
Be aware that as buffer I'm referring to that char* buffer in the code, I'm not talking about the OS file buffers
Yes there will be. In all practical operating systems, the price of doing i/o is quite a bit higher than the tradeoff of having less memory available. Practical buffer sizes are much smaller than might be expected. The C runtime library default buffer size for a FILE * of 512 bytes is pretty good—really good in fact for the multitude of situations it is used. And that was developed for a 65,536 byte memory space on Unix V6 (c. 1978).
Carefully measuring throughput, CPU load, and overall system load to optimize buffer size has always led me to choose a buffer size in the range of 1024 to 16384 bytes. The only exception is for files a little larger than that range, in which case it is optimal to hold the whole file in memory when memory is available.
You will already be reading buffered, as it does not read directly from the file to the buffer you put into the call to read, but will first go to the buffer inside the fstream (a filebuf).
The first sequence will still be quicker because it will loop fewer times, but not as drastic as people may think because the file I/O itself won't be any slower.
You can change the internal buffer of the fstream but that is a more complex issue.
The runtime library will buffer for you. The extra cost here is mostly in executing the same instructions over and over to read one byte versus blocked reads. Number of calls is aBlock times greater if you read one byte at a time.
It's much faster to use a large buffer, because the drive will then be able to read eg. an entire cylinder at once, instead of reading a sector, then waiting for the next request to read another. Also, there's a lot of overhead involved in each request, like getting access to the bus and setting the dma controller.
Just take to don't use so much memory that you'll need to swap data out to the disk, which will slow things down a lot.
There is a actually a huge difference, depending on the implementation. In Visual C++ 2008, there is a critical section that is entered on each call to read(). So the second set of code in the question will be entering "aBlock" more critical sections that the first set of code.