AES memory efficiency - c++

I'm writing a C++ program for AES encryption with CTR block chaining, but my question doesn't require knowledge of either.
I'm wondering how much of a file i should buffer to encrypt and output to the new encrypted file. I ask this because i know disk reads are quite expensive so it only makes sense i should, if possible, read and buffer the entire original file, encrypt, output to new file. However, if the file is 1gb, i don't want to reserve a whole 1gb in main memory for the during of the encryption.
So, im curious what the optimal buffer size is? For example, buffering 100mb and performing 10 iterations of encryption to process the entire 1gb file. Thanks.

Memory map the file and let the system figure out the right buffer size.
Usually the file is buffered into main memory anyway (on server and desktop systems). So the buffer size in your application can be kept relatively small. 1 MiB would be plenty and would probably not matter much on any system with 1 GiB of main memory or more.
On embedded systems that do not buffer memory it may require some figuring out what is happening underneath and how much memory needs to be taken. I would consider a buffer of about 1-8 KiB a good minimum requirement. If you go lower than that and you might want to time the AES operations as well.
To make sure you can optimize later on, you may want to keep to make the buffer a multiple of 64 bytes (the block size of AES is 16 bytes and that of SHA-2 512 is 64 bytes). In general, try and keep to full powers of two or as close to that as possible (one MiB is 2^20 bytes).

Who's telling you that "disk reads are quite expensive"? Unless you're processing terabytes of data the cost of IO is going to be so inconsequential you'll have a hard time measuring it. A 1MB buffer will be way more than what you need. I bet you'd have a hard time finding a benchmarkable difference between 64KB and 1MB or more.
The one exception to this is if you're reading a lot of data off of a really slow device, something like a NAS drive on a congested network, but even then I'd consider any effort to implement buffering to be a false optimization. In that case copy the data to a local drive, process it off of local storage.
C++ buffers input and output with reasonable defaults anyway, plus most operating systems will fetch blocks of data as you're reading sequentially in order to make retrieval efficient. Unless you have a very compelling reason, stick with the normal behaviour. There should be no need to write custom buffering code.

Related

Optimal way of writing to append-only files on an SSD

I want to know what's the optimal way to log to an SSD. Think of something like a database log, where you're writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.
I'm going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don't Stack your Log on my Log [pdf].
SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls "unusable shit" in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.
As an asside, looking at my PC filesystem: C:\WINDOWS\system32>fsutil
fsinfo ntfsinfo c: It has a 512 byte sector size and a 4kb allocation
(cluster) size. Neither of which map to the SSD page size - probably
not very efficient.
There's some issues with just writing with e.g. pwrite() to the kernel page cache and letting the OS handle writing things out. First off, you'll need to issue an additional sync_file_range() call after calling pwrite() to actually kick off the IO, otherwise it will all wait until you call fsync() and unleash an IO storm. Secondly fsync() seems to block future calls to write() on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.
Because of the above reasons, and because I need AIO for reads of the log anyway, I'm opting for writing to the log with O_DIRECT and O_DSYNC and having full control.
As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that's not so bad. But my question is, wouldn't it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?
That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don't flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time - it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?
I will try to answer your question, as I had the same task but in SD cards, which is still a flash memory.
Short Answer
You can only write a full page of 512 bytes in flash memory. Given the flash memory has a poor write count, the driver chip is buffering/randomizing to improve your drive lifetime.
To write a bit in flash memory, you must erase the entire page (512 bytes) where it sits first. So if you want to append or modify 1 byte somewhere, first it has to erase the entire page where it resides.
The process can be summarized as:
Read the whole page to a buffer
Modify the buffer with your added content
Erase the whole page
Rewrite the whole page with the modified buffer
Long Answer
The Sector (pages) is basically down to the very hardware of the flash implementation and flash physical driver, in which you have no control. That page has to be cleared and rewritten each time you change something.
As you probably already know, you cannot rewrite a single bit in a page without clearing and rewriting the entire 512 bytes. Now, Flash drives have a write cycle life of about 100'000 before a sector can be damaged. To improve lifetime, usually the physical driver, and sometimes the system will have a writing randomization algorithm to avoid always writing the same sector. (By the way, never do defragmentation on an SSD; it's useless and at best reduces the lifetime).
Concerning the cluster, this is handled at a higher level which is related to the file system and this you have control. Usually, when you format a new hard drive, you can select the cluster size, which on windows refers to the Allocation Unit Size of the format window.
Most file systems as I know work with an index which is located at the beginning of the disk. This index will keep track of each cluster and what is assigned to it. This means a file will occupy at least 1 sector, even if it's much smaller.
Now the trade-off is smaller is your sector size, bigger will be your index table and will occupy a lot of space. But if you have a lot of small files, then you will have a better occupation space.
On the other hand, if you only store big files and you want to select the biggest sector size, just slightly higher than your file size.
Since your task is to perform logging, I would recommend to log in single, huge file with big sector size. Having experimented with this type of log, having large amount of file within a single folder can cause issue, especially if you are in embedded devices.
Implementation
Now, if you have raw access to the drive and want to really optimize, you can directly write to the disk without using the file system.
On the upside
* Will save you quite some disk space
* Will render the disk tolerant in case of failure if your design is smart enough
* will require much fewer resources if you are on a limited system
On the downside
* Much more work and debug
* The drive won't be natively recognized by the system.
If you only log, you don't need to have a file system, you just need an entry point to a page where to write your data, which will continuously increase.
The implementation I've done on an SD card was to save 100 pages at the begging of the flash to store information about write and read location. This was held in a single page, but to avoid memory cycle issue, I would sequentially write in a circular method over the 100 pages and then have an algorithm to check which was the last to contain most recent information.
The position storage was written was done every 5 minutes or so which means in case of the power outage I would lose only 5 minutes of the log. It is also possible from the last write location to check further sector if they contain valid data before writing further.
This provided a very robust solution as they are very less likely to have table corruption.
I would also suggest to buffer 512 bytes and write page by page.
Others
You may also want to check some log specific file system, they might simply do the job for you: Log-structured file system

C++ Reading from several sections of a file is too slow

I need to read byte arrays from several locations of a big file.
I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible.
I have 20 calls like this one:
m_content.resize(iByteCount);
fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile);
iByteCount is around 5000 on average.
Before using fread, I used a memory-mapped file, but the results were approximately the same.
My calls are still too slow (around 200 ms) when called for the first time. When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me.
The file is big (around 200 mb).
After this call, I have to read double values from a different section of the file, but I can not avoid this.
I don't want to split it up in 2 files. I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow.
If I use memory-mapping, the first call of reading is always slow. If I then repeat reading from this section, it is lightening fast. When I then read from a different section, it is slow for the first time, but then lightening fast the second time.
I have no idea why this is so.
Does anybody have any more ideas for me?
Thank you.
Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth.
What you feel most is access time. Access time is typically in the millisecond ballpark. Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average").
Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB).
Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. Nor can you change the seek time. However, you can change the number of seeks.
In practice, this means you must avoid seeks as much as you can. If that means reading in data that you do not need, do not be afraid of doing so. It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB.
If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). But don't worry, you have already had the best solution to the problem: memory mapping.
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get.
The big advantage of memory mapping is that you can directly read from the buffer cache. Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". Also, what is stored in the buffer cache is in some sense "free".
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive).
While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. Reading from buffers works more or less in zero time.
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow).
Linux has the readahead syscall which lets you prefetch data. Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (i.e. 256kiB).
Do not have yourself being fooled by the docs, as the docs are deceptive. It may seem that MADV_RANDOM is a better choice, as your access is "random". It makes sense to be honest to the OS about what you're doing, doesn't it? Usually yes, but not here. This, simply turns off prefetching, which is the exact opposite of what you really want. I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance.
Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. On older versions, there is just... nothing.
A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. This sounds horrendous, but it works very nicely, and is operating-system agnostic.
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU.
(moved to an answer as requested by the OP)
You cannot read from a file any quicker (there is no magic flag to say "read faster"). There is either an issue with your hardware or 200mS is how long it is supposed to take
1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time. However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms).
Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. You can read the libc/OS/controller doc to try and align your reads on these chunks.
2) You're using stream input, try using direct open/read/close functions : low-level I/O have less overhead (obviously). Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue.
as it look you have a good benchmark, try to switch the size and the count in your fread call. reading 1 times 1000 bytes will be faster than 1000 x 1 byte.
Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. You're always going to pay that cost one time.
You could improve your performance a little by using memory mapped IO. See either mmap (Linux) or CreateFileMapping+MapViewOfFile (Windows).
I have already optimized the file so that as few sections as possible have to be read
Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest.
Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited.
Two possible ideas I had are: -
1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. You'd have to test if this benefits at all.
2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. You may be doing this already, but if not, I thought it worth mentioning.

What would be an ideal buffer size? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you determine the ideal buffer size when using FileInputStream?
When reading raw data from a file (or any input stream) using either the C++'s istream family's read() or C's fread(), a buffer has to be supplied, and a number of how much data to read. Most programs I have seen seem to arbitrarily chose a power of 2 between 512 and 4096.
Is there a reason it has to/should be a power of 2, or this just programer's natural inclination to powers of 2?
What would be the "ideal" number? By "ideal" I mean that it would be the fastest. I assume it would have to be a multiple of the underlying device's buffer size? Or maybe of the underlying stream object's buffer? How would I determine what the size of those buffers is, anyway? And once I do, would using a multiple of it give any speed increase over just using the exact size?
EDIT
Most answers seem to be that it can't be determined at compile time. I am fine with finding it at runtime.
SOURCE:
How do you determine the ideal buffer size when using FileInputStream?
Optimum buffer size is related to a number of things: file system
block size, CPU cache size and cache latency.
Most file systems are configured to use block sizes of 4096 or 8192.
In theory, if you configure your buffer size so you are reading a few
bytes more than the disk block, the operations with the file system
can be extremely inefficient (i.e. if you configured your buffer to
read 4100 bytes at a time, each read would require 2 block reads by
the file system). If the blocks are already in cache, then you wind up
paying the price of RAM -> L3/L2 cache latency. If you are unlucky and
the blocks are not in cache yet, the you pay the price of the
disk->RAM latency as well.
This is why you see most buffers sized as a power of 2, and generally
larger than (or equal to) the disk block size. This means that one of
your stream reads could result in multiple disk block reads - but
those reads will always use a full block - no wasted reads.
Ensuring this also typically results in other performance friendly parameters affecting both reading and subsequent processing: data bus width alignment, DMA alignment, memory cache line alignment, whole number of virtual memory pages.
At least in my case, the assumption is that the underlying system is using a buffer whose size is a power of two, too, so it's best to try and match. I think nowadays buffers should be made a bit bigger than what "most" programmers tend to make them. I'd go with 32 KB rather than 4, for instance.
It's very hard to know in advance, unfortunately. It depends on whether your application is I/O or CPU bound, for instance.
I think that mostly it's just choosing a "round" number. If computers worked in decimal we'd probably choose 1000 or 10000 instead of 1024 or 8192. There is no very good reason.
One possible reason is that disk sectors are usually 512 bytes in size so reading a multiple of that is more efficient, assuming that all the hardware layers and caching cause the low level code to actually be able to use this fact efficiently. Which it probably can't unless you are writing a device driver or doing an unbuffered read.
No reason that I know of that it has to be a power of two. You are constrained by the buffer size having to be within max size_t but this is unlikely to be an issue.
Clearly the bigger the buffer the better but this is obviously not scalable so some account must be taken of system resource considerations either at compile time or preferably at runtime.
1 . Is there a reason it has to/should be a power of 2, or this just programer's natural inclination to powers of 2?
Not really. It should probably be something that goes even in the size of the data bus width to simplify memory copy, so anything that divies into 16 would be good with the current technology. Using a power of 2 makes it likely that it will work well with any future technology.
2 . What would be the "ideal" number? By "ideal" I mean that it would be the fastest.
The fastest would be as much as possible. However, once you go over a few kilobytes you will have a very small performance difference compared to the amount of memory that you use.
I assume it would have to be a multiple of the
underlying device's buffer size? Or maybe of the underlying stream
object's buffer? How would I determine what the size of those buffers
is, anyway?
You can't really know the size of the underlying buffers, or depend on that they remain the same.
And once I do, would using a multiple of it give any speed
increase over just using the exact size?
Some, but very little.
I think Ideal size of Buffer is size of one block in your hard drive,so it can map properly with your buffer while storing or fetching data from hard drive.

Will buffer size make a difference in time cost while reading files?

I have two options concerning buffer sizes when reading files.
char* buffer = new char[aBlock];
myFile.read(buffer,aBlock);
and,
char* buffer = new char;
while (!myFIle.eof())
myFile.read(buffer,1);
Will there be a considerable time cost difference?
Be aware that as buffer I'm referring to that char* buffer in the code, I'm not talking about the OS file buffers
Yes there will be. In all practical operating systems, the price of doing i/o is quite a bit higher than the tradeoff of having less memory available. Practical buffer sizes are much smaller than might be expected. The C runtime library default buffer size for a FILE * of 512 bytes is pretty good—really good in fact for the multitude of situations it is used. And that was developed for a 65,536 byte memory space on Unix V6 (c. 1978).
Carefully measuring throughput, CPU load, and overall system load to optimize buffer size has always led me to choose a buffer size in the range of 1024 to 16384 bytes. The only exception is for files a little larger than that range, in which case it is optimal to hold the whole file in memory when memory is available.
You will already be reading buffered, as it does not read directly from the file to the buffer you put into the call to read, but will first go to the buffer inside the fstream (a filebuf).
The first sequence will still be quicker because it will loop fewer times, but not as drastic as people may think because the file I/O itself won't be any slower.
You can change the internal buffer of the fstream but that is a more complex issue.
The runtime library will buffer for you. The extra cost here is mostly in executing the same instructions over and over to read one byte versus blocked reads. Number of calls is aBlock times greater if you read one byte at a time.
It's much faster to use a large buffer, because the drive will then be able to read eg. an entire cylinder at once, instead of reading a sector, then waiting for the next request to read another. Also, there's a lot of overhead involved in each request, like getting access to the bus and setting the dma controller.
Just take to don't use so much memory that you'll need to swap data out to the disk, which will slow things down a lot.
There is a actually a huge difference, depending on the implementation. In Visual C++ 2008, there is a critical section that is entered on each call to read(). So the second set of code in the question will be entering "aBlock" more critical sections that the first set of code.

What is the ideal memory block size to use when copying?

I am currently using 100 megabytes per memory block to copy large files.
Is there a "good" amount that people normally use?
Edit
Thanks for all the great responses.
I'm still quite new to these concepts so I'll try to understand a lot of the ones that have been said (e.g. write back cache). I keep learning new things :)
A block between 4096 and 32KB is the typical choice. Using 100MB is counter-productive. You are occupying RAM with the buffer that can be put to much better use as the file system writeback cache.
Copying files is very fast when the file fits completely in the cache, the WriteFile() call is a simple memory-to-memory copy. The cache manager then lazily writes it out to the disk. But when there's no more room in the cache, the copy speed drops off a cliff when WriteFile() has to wait for space to be made available. It now goes at disk write speeds.
I would recommend you to benchmark this, and remember to include much smaller block sizes. In my own tests on this, I got quite counterintuitive results.
When reading and writing from the hard drive, all (power of two) block sizes between 512 byte and 512 kB gave the same speed. Increasing the block size from 512 kB to 1 MB reduced the copying speed to about 60%. Increasing the block size further increased the speed again, but never all the way back to the speed of using small blocks.
When all the copied data was in the cache memory, the (much faster) copying speed improved with increasing block sizes, flattening out around reaching 32 kB blocks, and then suddenly dropped to about half the previous speed when going from 256 kB to 512 kB blocks, never to return to the previous speeds.
After this test, I dropped read/write block sizes in several of my programs from around 1 MB to 32 kB.
There's generally little benefit in using blocks that large.
Suppose your operating system is super-naive and every read or write operation incurs a hard disc seek (in practice you will often find that writes get queued and reads get read-ahead-buffered, reducing the benefit of using large buffers in your application code).
Then every block costs you (say) 2x10ms for two seeks (one to read and one to write) and there's little point increasing your block size once the time for the actual reading and writing is substantially more than that. A really fast HD might read and write at 150MB/s, in which case that 10ms would correspond to 1.5MB of reading/writing, and you'd be gaining little for blocksizes beyond 15MB.
In practice, (1) your seek time will probably be less, (2) your read and write bandwidth will probably be more, and (3) your OS and drive hardware will probably be cacheing and queueing things for you; you'll probably see little or no benefit from blocksizes above about 100KB.
(You should probably benchmark a variety of blocksizes and see what you get on your own system.)
That's a pretty excessive amount. Consider that you don't even start writing data before reading 100 MB, so the filesystem driver doesn't even have an opportunity to write any of the destination file while you're reading. The disk could be writing parts of the file that happen to pass under the head as it's reading the source file (see elevator seek for example).
I think that it depends on size of free memory that you have.
If you use 100 M blocks to copy on machine that has for example 30Mb of empty memory then it'll take much more time to copy than using smaller (20M) block.
If your buffor for copying is larger than size of available free memory then due to virtual memory swapping your copying will be slower than expected.
Given that the drive must seek when it changes tracks, might not a block size of say 63 x 512 = 32256 produce optimum results?