Syscall overhead - c++

How big is (approximately) an I/O syscall overhead on Linux from C program, I mean how bad is running e.g. many small read / write operations compared with read / write on large buffers (on regular files or network sockets)? App is strongly multithreaded.

Syscalls take at least 1-2 microseconds on most modern machines just for the syscall overhead, and much more time if they're doing anything complex that could block or sleep. Expect at least 20 microseconds and up to the order of milliseconds for IO. Compare this with a tiny function call or macro that reads a byte from a userspace buffer, which is likely to complete in a matter of nanoseconds (maybe 200 ns on a bad day).

You can measure this yourself. Just open /dev/zero and do some reading and writing while measuring the time. Also vary the number of bytes you put into each call - e.g. 1 bytes, 2 bytes, 128 bytes, .. 4096bytes. Also take care to use the read(2) and write(2) syscalls and not anything using internal buffers.

Related

Parallel compression algorithms

Many/most compression algorithms have a parallel-decompression implementation (like pigz for gzip, etc).
However, rarely does one see a reduction in time proportional to the number of processors thrown at the task, with most not benefiting at all from more than 6 processors.
I'm curious to know if there are any compression formats with parallel decompression built into the design - i.e. would be theoretically 100x faster with 100 cpus than with 1.
Thank you and all the best :)
You're probably I/O bound. At some point more processors won't help if they're waiting for input or output. You just get more processors waiting.
Or maybe your input files aren't big enough.
pigz will in fact be 100x faster with 100 cpus, for a sufficiently large input, if it is not I/O bound. By default, pigz sends 128K blocks to each processor to work on, so you would need the input to be at least 13 MB in order to provide work for all 100 processors. Ideally a good bit more than that to get all the processors running at full steam at the same time.

Reading inputs in smaller, more frequent reads or one larger read

I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?
The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.
The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.

C++ Reading from several sections of a file is too slow

I need to read byte arrays from several locations of a big file.
I have already optimized the file so that as few sections as possible have to be read, and the sections are as closely together as possible.
I have 20 calls like this one:
m_content.resize(iByteCount);
fseek(iReadFile,iStartPos ,SEEK_SET);
size_t readElements = fread(&m_content[0], sizeof(unsigned char), iByteCount, iReadFile);
iByteCount is around 5000 on average.
Before using fread, I used a memory-mapped file, but the results were approximately the same.
My calls are still too slow (around 200 ms) when called for the first time. When I repeat the same call with the same sections of bytes to read, it is very fast (around 1 ms), but that does not really help me.
The file is big (around 200 mb).
After this call, I have to read double values from a different section of the file, but I can not avoid this.
I don't want to split it up in 2 files. I have seen the "huge file approach" used by other people, too, and they overcame this problem somehow.
If I use memory-mapping, the first call of reading is always slow. If I then repeat reading from this section, it is lightening fast. When I then read from a different section, it is slow for the first time, but then lightening fast the second time.
I have no idea why this is so.
Does anybody have any more ideas for me?
Thank you.
Disk drives have two (actually three) factors that limit their speed: access time, sequential bandwidth, and bus latency/bandwidth.
What you feel most is access time. Access time is typically in the millisecond ballpark. Having to do a seek takes upwards of 5 (often more than 10) milliseconds on a typical harddisk. Note that the number printed on a disk drive is the "average" time, not the worst time (and, in some cases it seems that it's much closer to "best" than "average").
Sequential read bandwidth is typically upwards of 60-80 MiB/s even for a slow disk, and 120-150 MiB/s for a faster disk (or >400MiB on solid state). Bus bandwidth and latency are something you usually don't care about as bus speed usually exceeds the drive speed (except if you use a modern solid state disk on SATA-2, or a 15k harddisk on SATA-1, or any disk over USB).
Also note that you cannot change the drive's bandwidth, nor the bus bandwidth. Nor can you change the seek time. However, you can change the number of seeks.
In practice, this means you must avoid seeks as much as you can. If that means reading in data that you do not need, do not be afraid of doing so. It is much faster to read 100 kiB than to read 5 kiB, seek ahead 90 kilobytes, and read another 5 kiB.
If you can, read the whole file in one go, and only use the parts you are interested in. 200 MiB should not be a big hindrance on a modern computer. Reading in 200 MiB with fread into an allocated buffer might however be forbidding (that depends on your target architecture, and what else your program is doing). But don't worry, you have already had the best solution to the problem: memory mapping.
While memory mapping is not a "magic accelerator", it is nevertheless as close to "magic" as you can get.
The big advantage of memory mapping is that you can directly read from the buffer cache. Which means that the OS will prefetch pages, and you can even ask it to more aggressively prefetch, so effectively all your reads will be "instantaneous". Also, what is stored in the buffer cache is in some sense "free".
Unluckily, memory mapping is not always easy to get right (especially since the documentation and the hint flags typically supplied by operating systems are deceptive or counter-productive).
While you have no guarantee that what has been read once stays in the buffers, in practice this is the case for anyting of "reasonable" size. Of course the operating system cannot and will not keep a terabyte of data in RAM, but something around 200 MiB will quite reliably stay in the buffers on a "normal" modern computer. Reading from buffers works more or less in zero time.
So, your goal is to get the operating system to read the file into its buffers, as sequentially as possible. Unless the machine runs out of physical memory so it is forced to discard buffer pages, this will be lightning fast (and if that happens, every other solution will be equally slow).
Linux has the readahead syscall which lets you prefetch data. Unluckily, it blocks until data has been fetched, which is not what you probably want (you would thus have to use an extra thread for this). madvise(MADV_WILLNEED) is a less reliable, but probably better alternative. posix_fadvise may work too, but note that Linux limits the readahead to twice the default readahead size (i.e. 256kiB).
Do not have yourself being fooled by the docs, as the docs are deceptive. It may seem that MADV_RANDOM is a better choice, as your access is "random". It makes sense to be honest to the OS about what you're doing, doesn't it? Usually yes, but not here. This, simply turns off prefetching, which is the exact opposite of what you really want. I don't know the rationale behind this, maybe some ill-advised attempt to converve memory -- in any case it is detrimental to your performance.
Windows (since Windows 8, for desktop only) has PrefetchVirtualMemory which does exactly what one would want here, but unluckily it's only available on the newest version. On older versions, there is just... nothing.
A very easy, efficient, and portable way of populating the pages in your mapping is to launch a worker thread that faults every page. This sounds horrendous, but it works very nicely, and is operating-system agnostic.
Something like volatile int x = 0; for(int i = 0; i < len; i += 4096) x += map[i]; is entirely sufficient. I am using such code to pre-fault pages prior to accessing them, it works at speeds unrivalled to any other method of populating buffers and uses very little CPU.
(moved to an answer as requested by the OP)
You cannot read from a file any quicker (there is no magic flag to say "read faster"). There is either an issue with your hardware or 200mS is how long it is supposed to take
1) The difference in access speed between your first read and subsequent ones is perfectly understandable : your first call actually read the file from the disk, and this takes time. However your kernel (not mentioning the disk controller) keep the accessed data buffered so when you access it a second time it is a pure memory access (1ms).
Even if you only need to access really tiny portions of the file, libc/kernel/controller optimizations access the disk in quite large chunk. You can read the libc/OS/controller doc to try and align your reads on these chunks.
2) You're using stream input, try using direct open/read/close functions : low-level I/O have less overhead (obviously). Nothing gets faster than this, so if you still find this too slow, you have an OS or hardware issue.
as it look you have a good benchmark, try to switch the size and the count in your fread call. reading 1 times 1000 bytes will be faster than 1000 x 1 byte.
Disk is slow, and as you pointed out, the delay comes from the first access - that's the disk spinning up and accessing the sectors necessary. You're always going to pay that cost one time.
You could improve your performance a little by using memory mapped IO. See either mmap (Linux) or CreateFileMapping+MapViewOfFile (Windows).
I have already optimized the file so that as few sections as possible have to be read
Correct me if I'm wrong, but in reference to the file being optimised, I'm assuming you mean you've ordered the sections to minimize the number of reads that take place and not what I'm going to suggest.
Being bound by IO here is likely due to the seek times, so other than getting a faster storage medium, your options are limited.
Two possible ideas I had are: -
1) Compress the data that is stored, which may give you slightly faster read times, but will still not help with seek time. You'd have to test if this benefits at all.
2) If relevant, as soon as you've retrieved one block of data, move it to a thread and start processing it while another read takes place. You may be doing this already, but if not, I thought it worth mentioning.

Understanding buffering behavior of fwrite()

I am using the function call fwrite() to write data to a pipe on Linux.
Earlier, fwrite() was being called for small chunks of data (average 20 bytes) repeatedly and buffering was left to fwrite(). strace on the process showed that 4096 bytes of data was being written at a time.
It turned out that this writing process was the bottleneck in my program. So I decided to buffer data in my code into blocks of 64KB and then write the entire block at a time using fwrite(). I used setvbuf() to set the FILE* pointer to 'No Buffering'.
The performance improvement was not as significant as I'd expected.
More importantly, the strace output showed that data was still being written 4096 bytes at a time. Can someone please explain this behavior to me? If I am calling fwrite() with 64KB of data, why is it writing only 4096 bytes at a time?
Is there an alternative to fwrite() for writing data to a pipe using a FILE* pointer?
The 4096 comes from the Linux machinery that underlies pipelines. There are two places it occurs. One is the capacity of the pipeline. The capacity is one system page on older versions of Linux, which is 4096 bytes on a 32 bit i386 machine. (On more modern versions of Linux the capacity is 64K.)
The other place you'll run into that 4096 bytes problem is in the defined constant PIPE_BUF, the number of bytes that are guaranteed to be treated atomically. On Linux this is 4096 bytes. What this limit means depends on whether you have set the pipeline to blocking or non-blocking. Do a man -S7 pipe for all the gory details.
If you are trying to exchange huge volumes of data at a high rate you might want to rethink your use of pipes. You're on a Linux box, so shared memory is an option. You can use pipes to send relatively small amounts of data as a signaling mechanism.
If you want to change the buffering behavior, you must do so immediately after the fopen (or before any I/O, for the standard filehandles stdin, stdout, stderr). You also do not want to disable buffering and try to manage the buffer yourself; rather, specify your 64K buffer to setvbuf so that it can be used properly.
If you really want to manage the buffering manually, do not use stdio; use the lower level open, write, and close calls.

Writing data chunks while processing - is there a convergence value due to hardware constraints?

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).
I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).
My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):
Buffer size time (minutes)
------------------------------
no Buffer ~ 8:30
1 MB ~ 6:15
10 MB ~ 5:45
50 MB ~ 7:00
Or is this just a coincidence ?
I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.
Edit:
Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).
Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?
Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:
Which filesystem you're using
What disk-scheduling algorithm the kernel is using
The hardware characteristics of your disk
The hardware interconnect you're using
For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.
If I were you I'd take these two steps:
Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.
Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.
Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.
The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.
That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.
Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.
You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.
Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.
If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.
You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.
[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.
Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.
Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.