What could be the cause for this performance difference?

What could be the cause for this performance difference? - c++

I'm constructing a cache file(~70MB for test) meant for spinning drives, there is a lot of random IO involved since I'm sorting things into it, somewhat alleviated by caching sequential items but I also have memory constraints.
Anyways, the difference appears between when I
a) freshly create the file and write it full of data ~100s
b) open the same file and write it full of data ~30s
I'm using memory mapped files to access them, when I freshly create a file I preallocate of course. I verified all the data, its accurate.
The data I'm writing is slightly different each time (something like 5% difference evenly distributed all over). Could it be that when I write to a mmf, and I overwrite something with the same data, it doesn't consider it a dirty page and thus doesn't actually write anything at all? How could it know?
Or perhaps there is some kind of write caching going on by windows or the hardware?

Try to trace the page faults. Or at least, try to monitor the page faults with process explorer for each write phase.
Anyway, when you open the same file with write access, the file is "recreated", but in memory the existing mapped paged are kept as it. Then, during the writing, if the data is binary the same inside a whole page (usually 4k per page so statistically this can happen with your data), the content of the page will not be flagged as "updated". So when closing file, no flush occurs for some pages that's why you see a big difference of performance.

Related

Memory mapped IO concept details

I'm attempting to figure out what the best way is to write files in Windows. For that, I've been running some tests with memory mapping, in an attempt to figure out what is happening and how I should organize things...
Scenario: The file is intended to be used in a single process, in multiple threads. You should see a thread as a worker that works on the file storage; some of them will read, some will write - and in some cases the file will grow. I want my state to survive both process and OS crashes. Files can be large, say: 1 TB.
After reading a lot on MSDN, I whipped up a small test case. What I basically do is the following:
Open a file (CreateFile) using FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH.
Build a mmap file handle (CreateFileMapping) on the file, using some file growth mechanism.
Map the memory regions (MapViewOfFile) using a multiple of the sector size (from STORAGE_PROPERTY_QUERY). The mode I intend to use is READ+WRITE.
So far I've been unable to figure out how to use these construct exactly (tools like diskmon won't work for good reasons) so I decided to ask here. What I basically want to know is: how I can best use these constructs for my scenario?
If I understand correctly, this is more or less the correct approach; however, I'm unsure as to the exact role of CreateFileMapping vs MapViewOfFile and if this will work in multiple threads (e.g. the way writes are ordered when they are flushed to disk).
I intend to open the file once per process as per (1).
Per thread, I intend to create a mmap file handle as per (2) for the entire file. If I need to grow the file, I will estimate how much space I need, close the handle and reopen it using CreateFileMapping.
While the worker is doing its thing, it needs pieces of the file. So, I intend to use MapViewOfFile (which seems limited to 2 GB) for each piece, process it annd unmap it again.
Questions:
Do I understand the concepts correctly?
When is data physically read and written to disk? So, when I have a loop that writes 1 MB of data in (3), will it write that data after the unmap call? Or will it write data the moment I hit memory in another page? (After all, disks are block devices so at some point we have to write a block...)
Will this work in multiple threads? This is about the calls themselves - I'm not sure if they will error if you have -say- 100 workers.
I do understand that (written) data is immediately available in other threads (unless it's a remote file), which means I should be careful with read/write concurrency. If I intend to write stuff, and afterwards update a single-physical-block) header (indicating that readers should use another pointer from now on) - then is it guaranteed that the data is written prior to the header?
Will it matter if I use 1 file or multiple files (assuming they're on the same physical device of course)?

Memory mapped files generally work best for READING; not writing. The problem you face is that you have to know the size of the file before you do the mapping.
You say:
in some cases the file will grow
Which really rules out a memory mapped file.
When you create a memory mapped file on Windoze, you are creating your own page file and mapping a range of memory to that page file. This tends to be the fastest way to read binary data, especially if the file is contiguous.
For writing, memory mapped files are problematic.

Extreme performance difference, when reading same files a second time with C

I have to read binary data into char-arrays from large (2GB) binary files in a C++ program. When reading the files for the first time from my SSD, reading takes about 6.4 seconds per file. But when running the same code again or even after running a different dummy-program, which does almost the same before, the next readings take only about 1.4 seconds per file. The Windows Task Manager even shows much less disk-activity on the second, third, fourth… run. So, my guess is Window’s File Caching is sparing me from waiting for data from the SSD, when filling the arrays another time.
Is there any clean option to read the files into file cache before the customer runs the software? Any better option than just already loading the files with fread in advance? And how can I make sure, the data remains in the File Cache until I need it?
Or am I totally wrong with my File Cache assumption? Is there another (better) explanation for these different loading times?

Educated guess here:
You most likely are right with your file cache assumption.
Can you pre load files before the user runs the software?
Not directly. How would your program be supposed to know that it is going to be run in the next few minutes?
So you probably need a helper mechanism or tricks.
The options I see here are:
Indexing mechanisms to provide a faster and better aimed access to your data. This is helpful if you only need small chunks of information from these data at once.
Attempt to parallelize the loading of the data, so even if it does not really get faster, the user has the impression it does because he can start working already with the data he has, while the rest is fetched in the background.
Have a helper tool starting up with the OS and pre-fetching everything, so you already have it in memory when required. Caution: This has serious implications since you reserve either a large chunk of RAM or even SSD-cache (depending on implementation) for your tool from the start. Only consider doing this if the alternative is the apocalypse…
You can also try to combine the first two options. The key to a faster data availability is to figure out what to read in which order instead of trying to load everything at once en-bloc. Divide and Conquer.
Without further details on the problem it is impossible to provide more specific solutions though.

Optimal way of writing to append-only files on an SSD

I want to know what's the optimal way to log to an SSD. Think of something like a database log, where you're writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.
I'm going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don't Stack your Log on my Log [pdf].
SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls "unusable shit" in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.
As an asside, looking at my PC filesystem: C:\WINDOWS\system32>fsutil
fsinfo ntfsinfo c: It has a 512 byte sector size and a 4kb allocation
(cluster) size. Neither of which map to the SSD page size - probably
not very efficient.
There's some issues with just writing with e.g. pwrite() to the kernel page cache and letting the OS handle writing things out. First off, you'll need to issue an additional sync_file_range() call after calling pwrite() to actually kick off the IO, otherwise it will all wait until you call fsync() and unleash an IO storm. Secondly fsync() seems to block future calls to write() on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.
Because of the above reasons, and because I need AIO for reads of the log anyway, I'm opting for writing to the log with O_DIRECT and O_DSYNC and having full control.
As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that's not so bad. But my question is, wouldn't it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?
That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don't flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time - it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?

I will try to answer your question, as I had the same task but in SD cards, which is still a flash memory.
Short Answer
You can only write a full page of 512 bytes in flash memory. Given the flash memory has a poor write count, the driver chip is buffering/randomizing to improve your drive lifetime.
To write a bit in flash memory, you must erase the entire page (512 bytes) where it sits first. So if you want to append or modify 1 byte somewhere, first it has to erase the entire page where it resides.
The process can be summarized as:
Read the whole page to a buffer
Modify the buffer with your added content
Erase the whole page
Rewrite the whole page with the modified buffer
Long Answer
The Sector (pages) is basically down to the very hardware of the flash implementation and flash physical driver, in which you have no control. That page has to be cleared and rewritten each time you change something.
As you probably already know, you cannot rewrite a single bit in a page without clearing and rewriting the entire 512 bytes. Now, Flash drives have a write cycle life of about 100'000 before a sector can be damaged. To improve lifetime, usually the physical driver, and sometimes the system will have a writing randomization algorithm to avoid always writing the same sector. (By the way, never do defragmentation on an SSD; it's useless and at best reduces the lifetime).
Concerning the cluster, this is handled at a higher level which is related to the file system and this you have control. Usually, when you format a new hard drive, you can select the cluster size, which on windows refers to the Allocation Unit Size of the format window.
Most file systems as I know work with an index which is located at the beginning of the disk. This index will keep track of each cluster and what is assigned to it. This means a file will occupy at least 1 sector, even if it's much smaller.
Now the trade-off is smaller is your sector size, bigger will be your index table and will occupy a lot of space. But if you have a lot of small files, then you will have a better occupation space.
On the other hand, if you only store big files and you want to select the biggest sector size, just slightly higher than your file size.
Since your task is to perform logging, I would recommend to log in single, huge file with big sector size. Having experimented with this type of log, having large amount of file within a single folder can cause issue, especially if you are in embedded devices.
Implementation
Now, if you have raw access to the drive and want to really optimize, you can directly write to the disk without using the file system.
On the upside
* Will save you quite some disk space
* Will render the disk tolerant in case of failure if your design is smart enough
* will require much fewer resources if you are on a limited system
On the downside
* Much more work and debug
* The drive won't be natively recognized by the system.
If you only log, you don't need to have a file system, you just need an entry point to a page where to write your data, which will continuously increase.
The implementation I've done on an SD card was to save 100 pages at the begging of the flash to store information about write and read location. This was held in a single page, but to avoid memory cycle issue, I would sequentially write in a circular method over the 100 pages and then have an algorithm to check which was the last to contain most recent information.
The position storage was written was done every 5 minutes or so which means in case of the power outage I would lose only 5 minutes of the log. It is also possible from the last write location to check further sector if they contain valid data before writing further.
This provided a very robust solution as they are very less likely to have table corruption.
I would also suggest to buffer 512 bytes and write page by page.
Others
You may also want to check some log specific file system, they might simply do the job for you: Log-structured file system

mmap versus memory allocated with new

I have a BitVector class that can either allocate memory dynamically using new or it can mmap a file. There isn't a noticeable difference in performance when using it with small files, but when using a 16GB file I have found that the mmap file is far slower than the memory allocated with new. (Something like 10x slower or more.) Note that my machine has 64GB of RAM.
The code in question is loading values from a large disk file and placing them into a Bloom filter which uses my BitVector class for storage.
At first I thought this might be because the backing for the mmap file was on the same disk as the file I was loading from, but this didn't seem to be the issue. I put the two files on two physically different disks, and there was no change in performance. (Although I believe they are on the same controller.)
Then, I used mlock to try to force everything into RAM, but the mmap implementation was still really slow.
So, for the time being I'm just allocating the memory directly. The only thing I'm changing in the code for this comparison is a flag the BitVector constructor.
Note that to measure performance I'm both looking at top and watching how many states I can add into the Bloom filter per second. The CPU usage doesn't even register on top when using mmap - although jbd2/sda1-8 starts to move up (I'm running on an Ubuntu server), which looks to be a process that is dealing with journaling for the drive. The input and output files are stored on two HDDs.
Can anyone explain this huge difference in performance?
Thanks!

Just to start with, mmap is an system call or interface provided to access the Virtual Memory of the system.
Now, in linux (I hope you are working on *nix) a lot of performance improvement is acheived by lazy loading or more commonly known as Copy-On-Write.
For mmap as well, this kind of lazy loading is implemented.
What happens is, when you call mmap on a file, kernel does not immediately allocate main memory pages for the file to be mapped. Instead, it waits for the program to write/read from the illusionary page, at which stage, a page fault occurs, and the corresponding interrupt handler will then actually load that particular file part that can be held in that page frame (Also the page table is updated, so that next time, when you are reading/writing to same page, it is pointing to a valid frame).
Now, you can control this behavior with mlock, madvise, MAP_POPULATE flag with mmap etc.
MAP_POPULATE flags with mmap, tells the kernel to map the file to memory pages before the call returns rather than page faulting every time you access a new page.So, till the file is loaded, the function will be blocked.
From the Man Page:
MAP_POPULATE (since Linux 2.5.46)
Populate (prefault) page tables for a mapping. For a file
mapping, this causes read-ahead on the file. Later accesses
to the mapping will not be blocked by page faults.
MAP_POPULATE is supported for private mappings only since
Linux 2.6.23.

FileFlushBuffer() is so slow

We are repeatedly writing (many 1000's of times) to a single large archive file, patching various parts of it. After each write, we were calling FileFlushBuffer(), but have found this is very, very slow. If we wait and only call it every now and then (say every 32ish files), things run better, but I don't think this is the correct way of doing this.
Is there any way to not flush the buffer at all until we complete our last patch? If we take away the call completetly, close() does handle the flush, but then it becomes a huge bottleneck in itself. Failing that, having it not lock our other threads when it runs would make it less annoying, as we won't be doing any IO read IO on that file outside of the write. It just feels like the disk system is really getting in the way here.
More Info:
Target file is currently 16Gigs, but is always changing (usually upwards). We are randomly pinging all over the place in the file for the updates, and it's big enough that we can't cache the whole file. In terms of fragmentation, who knows. This is a large database of assets that gets updated frequently, so quite probably. Not sure of how to make it not fragment. Again, open to any suggestions.

If you know the maximum size of the file at the start then this looks like a classic memory mapped file application
edit. (On windows at least) You can't change the size of a memory mapped file while it's mapped. But you can very quickly expand it between opening the file and opening the mapping, simply SetFilePointer() to some large value and setEndOfFile(). You can similarly shrink it after you close the mapping and before you close the file.
You can map a <4Gb view (or multiple views) into a much larger file and the filesystem cache tends to be efficent with memory mapped files because it's the same mechanism the OS uses for loading programs, so is well tuned. You can let the OS manage when an update occurs or you can force a flush of certain memory ranges.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js