Need help in debugging this apparent paradox in C++ [duplicate]

Need help in debugging this apparent paradox in C++ [duplicate] - c++

I am unable to find the underlying concept of IO Stream Buffering and what does it mean.
Any tutorials and links will be helpful.

Buffering is a fundamental part of software that handles input and output. The buffer holds data that is in between the software interface and the hardware interface, since hardware and software run at different speeds.
A component which produces data can put it into a buffer, and later the buffer is "flushed" by sending the collected data to the next component. Likewise the other component may be "waiting on the buffer" until a complete piece of data, or enough data to be efficiently processed, is available for input.
In C++, std::basic_filebuf implements a buffer over a filesystem file. It stores up to a fixed number of bytes so the operating system always works with a minimum transaction size, while the program can access individual characters if desired.
See Wikipedia.

Buffering is using memory (users memory) instead of sending the data straight to the OS (i.e. disk). Saves on a context switch.

Here's the concept. Imagine you have an application that needs to write it's data onto the hard drive. Let's say it wants to write something (e.g. update a log file) every half of a second. Is this good? No, and here is the reason.
Software can be very fast, but the speed on which the HDD can operate is limited, and it's much slower than the memory, and your application. To write something, the HDD needs to reposition it's magnetic heads to a specific sector (which probably involves slowing the disc rotation speed), write the data, and reposition back to where it was. So your application could operate very slowly (well, that's a theoretical example of course).
Buffering helps to deal with this. Instead of writing to the disc each time, the data is being accumulated in the buffer somewhere in the memory. Once a sufficient amount of data is gathered, the buffer is flushed: the data from it gets written on the disk. Such approach helps to minimize HDD operations and improve overall speed.

Related

Optimal way of writing to append-only files on an SSD

I want to know what's the optimal way to log to an SSD. Think of something like a database log, where you're writing append-only, but you also have to fsync() every transaction or few transactions to ensure application level data durability.
I'm going to give some background on how SSDs work, so if you already know all this, please skim it anyway in case I am wrong about something. Some good stuff for further reading is Emmanuel Goossaert 6-part guide to coding for SSDs and the paper Don't Stack your Log on my Log [pdf].
SSDs write and read in whole pages only. Where the page size differs from SSD to SSD but is typically a multiple of 4kb. My Samsung EVO 840 uses an 8kb page size (which incidentally, Linus calls "unusable shit" in his usual colorful manner.) SSDs cannot modify data in-place, they can only write to free pages. So combining those two restrictions, updating a single byte on my EVO requires reading the 8kb page, changing the byte, and writing it to a new 8kb page and updating the FTL page mapping (a ssd data structure) so the logical address of that page as understood by the OS now points to the new physical page. Because the file data is also no longer contiguous in the same erase block (the smallest group of pages that can be erased) we are also building up a form of fragmentation debt that will cost us in future garbage collection in the SSD. Horribly inefficient.
As an asside, looking at my PC filesystem: C:\WINDOWS\system32>fsutil
fsinfo ntfsinfo c: It has a 512 byte sector size and a 4kb allocation
(cluster) size. Neither of which map to the SSD page size - probably
not very efficient.
There's some issues with just writing with e.g. pwrite() to the kernel page cache and letting the OS handle writing things out. First off, you'll need to issue an additional sync_file_range() call after calling pwrite() to actually kick off the IO, otherwise it will all wait until you call fsync() and unleash an IO storm. Secondly fsync() seems to block future calls to write() on the same file. Lastly you have no control over how the kernel writes things to the SSD, which it may do well, or it may do poorly causing a lot of write amplification.
Because of the above reasons, and because I need AIO for reads of the log anyway, I'm opting for writing to the log with O_DIRECT and O_DSYNC and having full control.
As I understand it, O_DIRECT requires all writes to be aligned to sector size and in whole numbers of sectors. So every time I decide to issue an append to the log, I need to add some padding to the end to bring it up to a whole number of sectors (if all writes are always a whole number of sectors, they will also be correctly aligned, at least in my code.) Ok, that's not so bad. But my question is, wouldn't it be better to round up to a whole number of SSD pages instead of sectors? Presumably that would eliminate write amplification?
That could burn a huge amount of space, especially if writing small amounts of data to the log at a time (e.g a couple hundred bytes.) It also may be unnecessary. SSDs like the Samsung EVO have a write cache, and they don't flush it on fsync(). Instead they rely on capacitors to write the cache out to the SSD in the event of a power loss. In that case, maybe the SSD does the right thing with an append only log being written sectors at a time - it may not write out the final partial page until the next append(s) arrives and completes it (or unless it is forced out of the cache due to large amounts of unrelated IOs.) Since the answer to that likely varies by device and maybe filesystem, is there a way I can code up the two possibilities and test my theory? Some way to measure write amplification or the number of updated/RMW pages on Linux?

I will try to answer your question, as I had the same task but in SD cards, which is still a flash memory.
Short Answer
You can only write a full page of 512 bytes in flash memory. Given the flash memory has a poor write count, the driver chip is buffering/randomizing to improve your drive lifetime.
To write a bit in flash memory, you must erase the entire page (512 bytes) where it sits first. So if you want to append or modify 1 byte somewhere, first it has to erase the entire page where it resides.
The process can be summarized as:
Read the whole page to a buffer
Modify the buffer with your added content
Erase the whole page
Rewrite the whole page with the modified buffer
Long Answer
The Sector (pages) is basically down to the very hardware of the flash implementation and flash physical driver, in which you have no control. That page has to be cleared and rewritten each time you change something.
As you probably already know, you cannot rewrite a single bit in a page without clearing and rewriting the entire 512 bytes. Now, Flash drives have a write cycle life of about 100'000 before a sector can be damaged. To improve lifetime, usually the physical driver, and sometimes the system will have a writing randomization algorithm to avoid always writing the same sector. (By the way, never do defragmentation on an SSD; it's useless and at best reduces the lifetime).
Concerning the cluster, this is handled at a higher level which is related to the file system and this you have control. Usually, when you format a new hard drive, you can select the cluster size, which on windows refers to the Allocation Unit Size of the format window.
Most file systems as I know work with an index which is located at the beginning of the disk. This index will keep track of each cluster and what is assigned to it. This means a file will occupy at least 1 sector, even if it's much smaller.
Now the trade-off is smaller is your sector size, bigger will be your index table and will occupy a lot of space. But if you have a lot of small files, then you will have a better occupation space.
On the other hand, if you only store big files and you want to select the biggest sector size, just slightly higher than your file size.
Since your task is to perform logging, I would recommend to log in single, huge file with big sector size. Having experimented with this type of log, having large amount of file within a single folder can cause issue, especially if you are in embedded devices.
Implementation
Now, if you have raw access to the drive and want to really optimize, you can directly write to the disk without using the file system.
On the upside
* Will save you quite some disk space
* Will render the disk tolerant in case of failure if your design is smart enough
* will require much fewer resources if you are on a limited system
On the downside
* Much more work and debug
* The drive won't be natively recognized by the system.
If you only log, you don't need to have a file system, you just need an entry point to a page where to write your data, which will continuously increase.
The implementation I've done on an SD card was to save 100 pages at the begging of the flash to store information about write and read location. This was held in a single page, but to avoid memory cycle issue, I would sequentially write in a circular method over the 100 pages and then have an algorithm to check which was the last to contain most recent information.
The position storage was written was done every 5 minutes or so which means in case of the power outage I would lose only 5 minutes of the log. It is also possible from the last write location to check further sector if they contain valid data before writing further.
This provided a very robust solution as they are very less likely to have table corruption.
I would also suggest to buffer 512 bytes and write page by page.
Others
You may also want to check some log specific file system, they might simply do the job for you: Log-structured file system

What is IO Stream Buffering?

I am unable to find the underlying concept of IO Stream Buffering and what does it mean.
Any tutorials and links will be helpful.

Buffering is a fundamental part of software that handles input and output. The buffer holds data that is in between the software interface and the hardware interface, since hardware and software run at different speeds.
A component which produces data can put it into a buffer, and later the buffer is "flushed" by sending the collected data to the next component. Likewise the other component may be "waiting on the buffer" until a complete piece of data, or enough data to be efficiently processed, is available for input.
In C++, std::basic_filebuf implements a buffer over a filesystem file. It stores up to a fixed number of bytes so the operating system always works with a minimum transaction size, while the program can access individual characters if desired.
See Wikipedia.

Buffering is using memory (users memory) instead of sending the data straight to the OS (i.e. disk). Saves on a context switch.

Here's the concept. Imagine you have an application that needs to write it's data onto the hard drive. Let's say it wants to write something (e.g. update a log file) every half of a second. Is this good? No, and here is the reason.
Software can be very fast, but the speed on which the HDD can operate is limited, and it's much slower than the memory, and your application. To write something, the HDD needs to reposition it's magnetic heads to a specific sector (which probably involves slowing the disc rotation speed), write the data, and reposition back to where it was. So your application could operate very slowly (well, that's a theoretical example of course).
Buffering helps to deal with this. Instead of writing to the disc each time, the data is being accumulated in the buffer somewhere in the memory. Once a sufficient amount of data is gathered, the buffer is flushed: the data from it gets written on the disk. Such approach helps to minimize HDD operations and improve overall speed.

Speeding up file I/O: mmap() vs. read()

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.

Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.

Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?

The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

performance of fread/fwrite while doing operation with binary file

I am writing some binary data into a binary file through fwrite and once i am through with writing i am reading back the same data thorugh fread.While doing this i found that fwrite is taking less time to write whole data where as fread is taking more time to read all data.
So, i just want to know is it fwrite always takes less time than fread or there is some issue with my reading portion.

Although, as others have said, there are no guarantees, you'll typically find that a single write will be faster than a single read. The write will be likely to copy the data into a buffer and return straight away, while the read will be likely to wait for the data to be fetched from the storage device. Sometimes the write will be slow if the buffers fill up; sometimes the read will be fast if the data has already been fetched. And sometimes one of the many layers of abstraction between fread/fwrite and the storage hardware will decide to go off into its own little world for no apparent reason.

The C++ language makes no guarantees on the comparative performance of these (or any other) functions. It is all down to the combination of hardware and operating system, the load on the machine and the phase of the moon.

These functions interact with the operating system's file system cache. In many cases it is a simple memory-to-memory copy. Write could indeed be marginally faster if you run your program repeatedly. It just needs to find a hole in the cache to dump its data. Flushing that data to the disk happens at a time you can't see or measure.
More work is usually needed to read. At a minimum it needs to traverse the cache structure to discover if the disk data is already cached. If not, it is going to have to block on a disk driver request to retrieve the data from the disk, that takes many milliseconds.
The standard trap with profiling this behavior is taking measurements from repeated runs of your program. They are not at all representative for the way your program is going to behave in the wild. The odds that the disk data is already cached are very good on the second run of your program. They are very poor in real life, reads are likely to be very slow, especially the first one. An extra special trap exists for a write, at some point (depending on the behavior of other programs too), the cache is not going to be able to buffer the write request. Write performance is then going to fall of a cliff as your program gets blocked until enough data is flushed to the disk.
Long story short: don't ever assume disk read/write performance measurements are representative for how your program will behave in production. And perhaps more to the point: there isn't anything you can do to solve disk I/O perf problems in your code.

You are seeing some effect of the buffer/cache systems as other have said, however, if you use async API (as you said your suing fread/write you should look at aio_read/aio_write) you can experiment with some other methods for I/O which are likely more well optimized for what your doing.
One suggestion is that if you are read/update/write/reading a file a lot, you should, by way of an ioctl or DeviceIOControl, request to the OS to provide you the geometry of the disk your code is running on, then determine the size of a disk cylander so you may be able to determine if you can do your read/write operations buffered inside of a single cylinder. This way, the drive head will not move for your read/write and save you a fair amount of run time.

How best to manage Linux's buffering behavior when writing a high-bandwidth data stream?

My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.

You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.

If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.

You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.

If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).

I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.

Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().

You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js