Understanding buffering behavior of fwrite()

Understanding buffering behavior of fwrite() - c++

I am using the function call fwrite() to write data to a pipe on Linux.
Earlier, fwrite() was being called for small chunks of data (average 20 bytes) repeatedly and buffering was left to fwrite(). strace on the process showed that 4096 bytes of data was being written at a time.
It turned out that this writing process was the bottleneck in my program. So I decided to buffer data in my code into blocks of 64KB and then write the entire block at a time using fwrite(). I used setvbuf() to set the FILE* pointer to 'No Buffering'.
The performance improvement was not as significant as I'd expected.
More importantly, the strace output showed that data was still being written 4096 bytes at a time. Can someone please explain this behavior to me? If I am calling fwrite() with 64KB of data, why is it writing only 4096 bytes at a time?
Is there an alternative to fwrite() for writing data to a pipe using a FILE* pointer?

The 4096 comes from the Linux machinery that underlies pipelines. There are two places it occurs. One is the capacity of the pipeline. The capacity is one system page on older versions of Linux, which is 4096 bytes on a 32 bit i386 machine. (On more modern versions of Linux the capacity is 64K.)
The other place you'll run into that 4096 bytes problem is in the defined constant PIPE_BUF, the number of bytes that are guaranteed to be treated atomically. On Linux this is 4096 bytes. What this limit means depends on whether you have set the pipeline to blocking or non-blocking. Do a man -S7 pipe for all the gory details.
If you are trying to exchange huge volumes of data at a high rate you might want to rethink your use of pipes. You're on a Linux box, so shared memory is an option. You can use pipes to send relatively small amounts of data as a signaling mechanism.

If you want to change the buffering behavior, you must do so immediately after the fopen (or before any I/O, for the standard filehandles stdin, stdout, stderr). You also do not want to disable buffering and try to manage the buffer yourself; rather, specify your 64K buffer to setvbuf so that it can be used properly.
If you really want to manage the buffering manually, do not use stdio; use the lower level open, write, and close calls.

Related

AES memory efficiency

I'm writing a C++ program for AES encryption with CTR block chaining, but my question doesn't require knowledge of either.
I'm wondering how much of a file i should buffer to encrypt and output to the new encrypted file. I ask this because i know disk reads are quite expensive so it only makes sense i should, if possible, read and buffer the entire original file, encrypt, output to new file. However, if the file is 1gb, i don't want to reserve a whole 1gb in main memory for the during of the encryption.
So, im curious what the optimal buffer size is? For example, buffering 100mb and performing 10 iterations of encryption to process the entire 1gb file. Thanks.

Memory map the file and let the system figure out the right buffer size.
Usually the file is buffered into main memory anyway (on server and desktop systems). So the buffer size in your application can be kept relatively small. 1 MiB would be plenty and would probably not matter much on any system with 1 GiB of main memory or more.
On embedded systems that do not buffer memory it may require some figuring out what is happening underneath and how much memory needs to be taken. I would consider a buffer of about 1-8 KiB a good minimum requirement. If you go lower than that and you might want to time the AES operations as well.
To make sure you can optimize later on, you may want to keep to make the buffer a multiple of 64 bytes (the block size of AES is 16 bytes and that of SHA-2 512 is 64 bytes). In general, try and keep to full powers of two or as close to that as possible (one MiB is 2^20 bytes).

Who's telling you that "disk reads are quite expensive"? Unless you're processing terabytes of data the cost of IO is going to be so inconsequential you'll have a hard time measuring it. A 1MB buffer will be way more than what you need. I bet you'd have a hard time finding a benchmarkable difference between 64KB and 1MB or more.
The one exception to this is if you're reading a lot of data off of a really slow device, something like a NAS drive on a congested network, but even then I'd consider any effort to implement buffering to be a false optimization. In that case copy the data to a local drive, process it off of local storage.
C++ buffers input and output with reasonable defaults anyway, plus most operating systems will fetch blocks of data as you're reading sequentially in order to make retrieval efficient. Unless you have a very compelling reason, stick with the normal behaviour. There should be no need to write custom buffering code.

Concurrently writing to file while reading it out using mmap

The situation is this.
A large buffer of data (which shall exceed reasonable RAM
consumption) is being generated by the program.
The program concurrently serves a websocket which will allow a web
client to specify a small subset of this buffer of data to view.
To support the first goal, the file is written to using standard methods (I use portable C-stdio fopen and fwrite because it's been shown to be faster than various "pure C++" methods. Doesn't matter. Data gets appended to file; stdio will buffer the writes and periodically flush them.)
To support the second goal (on BSD, in particular iOS), the file is opened (open from sys/fcntl.h -- not as portable as stdio.h) and memory-mapped (mmap from sys/mman.h -- ditto). By deciding to use memory mapping I have to give up some portability with this code. It seems like Boost is something I could look at to avoid wheel reinvention.
Anyway, my question is about how exactly I'm supposed to do this, because there will be at least two threads: The main program thread appending to the file periodically, and the network (or a worker) thread which responds to web requests and delivers data read out of the memory regions that are mapped to the file on disk.
Supposing the file starts out 1024 bytes in size, mmap is called initially mapping 1024 bytes. As the main thread writes a further 512 bytes into the file, how can the network thread be notified or know anything about the current actual size of the file (so that it can munmap and mmap again with a larger buffer corresponding to the new size)? Furthermore, if I do this naively, I am wary of a situation where the main thread reports that 512 bytes are written, so the other thread now maps 1536 bytes of the file, but not all of the new 512 bytes actually got written to disk yet (OS is still working on writing it, maybe). What happens now? Could there be some garbage that shows up? Will my program crash?
How can I determine when data has been properly flushed? How can I be notified in a timely fashion after the data has been flushed so that I can memory map it?
In particular, is calling fflush the only way to guarantee that the file is now updated w.r.t. the stream, and then can I guarantee (once fflush returns) that the memory map can access the new size without an access violation? What about fsync?

When you are using POSIX API directly in the form of mmap, you should also be using it directly for the writing. POSIX and LibC interfaces just don't play well together.
write is a system call which transfers the data directly to kernel. It would be slow for writing byte-by-byte, but for writing large buffers it is tiny fraction faster because it has less overhead (fwrite ends up calling write under the hood anyway). And it is definitely more efficient that fwrite+fflush, because those may end up being two or more calls to write and if you do direct write, it is just one.
The documentation of mmap is not very clear about it, but it seems you must not request more bytes than the file actually has.

Write a large file to disk from RAM

If I need to write a large file from allocated memory to disk, what is the most efficient way to do it?
Currently I use something along the lines of:
char* data = static_cast<char*>(operator new(0xF00000000)); // 60 GB
// Do something to fill `data` with data
std::ofstream("output.raw", std::ios::binary).
write(data, 0xF00000000);
But I am not sure if the most straightforward way is also the most efficient, taking into account various buffering mechanisms and alike.
I am using Windows 7 64-bit and Visual Studio 2012 RC compiler with 64-bit target.

For Windows, you should use CreateFile API. Have a good read of that page and any links from it mentioning optimization. There are some flags you pass in to turn off buffering. I did this in the past when I was collecting video at about 800MB per second, and having to write off small parts of it as fast as possible to a RAID array.
Now, for the flags - I think it's primarily these:
FILE_FLAG_NO_BUFFERING
FILE_FLAG_WRITE_THROUGH
For reading, you may want to use FILE_FLAG_SEQUENTIAL_SCAN, although I think this has no effect if buffering is turned off.
Have a look at the Caching Behaviour section
There's a couple of things you need to do. Firstly, you should always write amounts of data that are a multiple of the sector size. This is (or at least was) 512 bytes almost universally, but you may want to consider up to 2048 in future.
Secondly, your memory has to be aligned to that sector size too. You can either use _aligned_malloc() or just allocate more buffer than you need and align manually.
There may be other memory optimization concerns, and you may want to limit individual write operations to a memory page size. I never went into that depth. I was still able to write data at speeds very close to the disk's limit. It was significantly faster than using stdio calls.
If you need to do this in the background, you can use overlapped I/O, but to be honest I never understood it. I made a background worker thread dedicated to writing out video buffer and controlled it externally.

The most promising thing that comes to mind is memory mapping the output file. Depending on how the data gets filled, you may even be able to have your existing program write directly to the disk via the pointer, and not need a separate write step at the end. That trusts the OS to efficiently page the file, which it may be having to do with the heap memory anyway... could potentially avoid a disk-to-disk copy.
I'm not sure how to do it in Windows specifically, but you can probably notify the OS of your intended memory access pattern to increase performance further.
(boost::asio has portable support for memory mapped files)

If you want to use std::ofstream you should make sure of the following:
No buffer is used by the file stream. The way to do this to call out.setbuf(0, 0).
Make sure that the std::locale used by stream doesn't do any character conversion, i.e., std::use_facet<std::codecvt<char, char> >(loc).always_noconv() yields true. The "C" locale does this.
With this, I would expect that std::ofstream is as fast as any other approach writing a large buffer. I would also expect it to be slower than using memory mapped I/O because memory mapped I/O should avoid paging sections of the memory when reading them just to write their content.

Open a file with CreateFile, use SetEndOfFile to preallocate the space for the file (to avoid too much fragmentation as you write), then call WriteFile with 2 MB sized buffers (this size works the best in most scenarios) in a loop until you write the entire file out.
FILE_FLAG_NO_BUFFERING may help in some situations and may make the situation worse in others, so no real need to use it, because normally Windows file system write cache is doing its work well.

Reading from a socket 1 byte a time vs reading in large chunk

What's the difference - performance-wise - between reading from a socket 1 byte a time vs reading in large chunk?
I have a C++ application that needs to pull pages from a web server and parse the received page line by line. Currently, I'm reading 1 byte at a time until I encounter a CRLF or the max of 1024 bytes is reached.
If reading in large chunk(e.g. 1024 bytes at a time) is a lot better performance-wise, any idea on how to achieve the same behavior I currently have (i.e. being able to store and process 1 html line at a time - until the CRLF without consuming the succeeding bytes yet)?
EDIT:
I can't afford too big buffers. I'm in a very tight code budget as the application is used in an embedded device. I prefer keeping only one fixed-size buffer, preferrably to hold one html line at a time. This makes my parsing and other processing easy as I am by anytime I try to access the buffer for parsing, I can assume that I'm processing one complete html line.
Thanks.

I can't comment on C++, but from other platforms - yes, this can make a big difference; particularly in the amount of switches the code needs to do, and the number of times it needs to worry about the async nature of streams etc.
But the real test is, of course, to profile it. Why not write a basic app that churns through an arbitrary file using both approaches, and test it for some typical files... the effect is usually startling, if the code is IO bound. If the files are small and most of your app runtime is spent processing the data once it is in memory, you aren't likely to notice any difference.

If you are reading directly from the socket, and not from an intermediate higher-level representation that can be buffered, then without any possible doubt, it is just better to read completely the 1024 bytes, put them in RAM in a buffer, and then parse the data from the RAM.
Why? Reading on a socket is a system call, and it causes a context switch on each read, which is expensive. Read more about it: IBM Tech Lib: Boost socket performances

First and simplest:
cin.getline(buffer,1024);
Second, usually all IO is buffered so you don't need to worry too much
Third, CGI process start usually costs much more then input processing (unless it is huge
file)... So you may just not think about it.

G'day,
One of the big performance hits by doing it one byte at a time is that your context is going from user time into system time over and over. And over. Not efficient at all.
Grabbing one big chunk, typically up to an MTU size, is measurably more efficient.
Why not scan the content into a vector and iterate over that looking out for \n's to separate your input into lines of web input?
HTH
cheers,

You are not reading one byte at a time from a socket, you are reading one byte at a atime from the C/C++ I/O system, which if you are using CGI will have alreadety buffered up all the input from the socket. The whole point of buffered I/O is to make the data available to the programmer in a way that is convenient for them to process, so if you want to process one byte at a time, go ahead.
Edit: On reflection, it is not clear from your question if you are implementing CGI or just using it. You could clarify this by posting a code snippet which indicates how you currently read read that single byte.
If you are reading the socket directly, then you should simply read the entire response to the GET into a buffer and then process it. This has numerous advantages, including performance and ease of coding.
If you are linitted to a small buffer, then use classic buffering algorithms like:
getbyte:
if buffer is empty
fill buffer
set buffer pointer to start of buffer
end
get byte at buffer pointer
increment pointer

You can open the socket file descritpor with the fdopen() function. Then you have buffered IO so you can call fgets() or similar on that descriptor.

There is no difference at the operating system level, data are buffered anyway. Your application, however, must execute more code to "read" bytes one at a time.

How best to manage Linux's buffering behavior when writing a high-bandwidth data stream?

My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.

You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.

If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.

You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.

If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).

I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.

Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().

You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js