I want to save network bandwidth by using compression, such as bzip2 or gzip.
Attackers, as well as normal users, may send compressed messages.
Are there sequences of bytes which will cause some decompression functions to become stuck in an infinite loop, or to use vast amounts of memory?
Is so, is this a fundamental property of those algorithms, or just an implementation bug?
I can only speak for zlib's inflate. There is no input that would result in an infinite loop or uncontrolled memory consumption.
Since the maximum compression of deflate is less than 1032:1, then inflate when working normally can expand up to almost 1032:1. You just need to be able to handle that possibility.
Related
I would like to increase the size of the sliding window for zlib beyond the maximum 32KB (I would like to match the window size to the length of the string that I am trying to compress). This is because I want to make sure that if a match exist it'll be found. Can this be done easily? Or are there subtleties in the implementation that I should consider?
It would require a redesign of the deflate format, which inherently only allows distances of 32768 or less, and a rewrite of the deflate code in zlib.
The redesign of the deflate format was already done once, resulting in deflate64 which permits distances up to 65536 (maybe not enough for you?), which the zlib code could in principle be rewritten to accommodate.
Alternatively, you can use other LZ compressors already written and tested with larger windows (often much larger windows), such as lzma or brotli.
Sometimes MPI is used to send low-entropy data in messages. So it can be useful to try to compress messages before sending it. I know that MPI can work on very fast networks (10 Gbit/s and more), but many MPI programs are used with cheap network like 0,1G or 1Gbit/s Ethernet and with cheap (slow, low bisection) network switch. There is a very fast Snappy (wikipedia) compression algorithm, which has
Compression speed is 250 MB/s and decompression speed is 500 MB/s
so on compressible data and slow network it will give some speedup.
Is there any MPI library which can compress MPI messages (at layer of MPI; not the compression of ip packets like in PPP).
MPI messages are also structured, so there can be some special method, like compression of exponent part in array of double.
PS: There is also LZ4 compression method with comparable speed
I won't swear that there's none out there, but there's none in common use.
There's a couple of reason's why it's not common:
MPI is often used for sending lots of floating point data which is hard (but not impossible) to compress well, and often has relatively high entropy after a while.
In addition, MPI users are often as concerned with latency as bandwidth, and adding a compression/decompression step into the message-passing critical path wouldn't be attractive to those users.
Finally some operations (like reduction collectives, or scatter gather) would be very hard to implement efficiently with compression.
However, you sound like your use case could benefit from this for point-to-point communications, so there's no reason why you couldn't do it yourself. If you were going to send a message of size N and the receiver expected it then:
sender calls compression routine, receives buffer and new length M;
if M >= N, send the original data, with an initial byte of say 0, as N+1 bytes to the
receiver
otherwise sends an initial byte of 1 + compressed data
receiver receives data into length N+1 buffer
if first byte is 1, calls MPI_Get_count to determine amount of data received, calls
decompression routine
otherwises uses uncompressed data
I can't give you much guidance as to the compresion routines, but it does look like people have tried this before, eg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.7936 .
I'll be happy to be told otherwise but I don't think many of us users of MPI are concerned with having a transport layer that compresses data.
Why the heck not ?
1) We already design our programs to do as little communication as possible, so we (like to think we) are sending the bare minimum across the interconnect.
2) The bulk of our larger messages comprise arrays of floating-point numbers which are relatively difficult (and therefore relatively expensive in time) to compress to any degree.
There's an ongoing project at the University of Edinburgh: http://link.springer.com/chapter/10.1007%2F978-3-642-32820-6_72?LI=true
I have an image stream coming in from a camera at about 100 frames/second, with each image being about 2 MB. Now just because of the disk write speed I know I can't write each frame, so I'm only trying to save about a third of those frames each second.
The stream is a circular buffer of large char arrays. And right now I'm using fwrite to dump each array to a temporary file as it gets buffered, but it only seems to be writing at about 20-30 MB/s while the hard drive should theoretically go up to 80-100 MB/s
Any thoughts? Is there a faster way to write than fwrite() or a way to optimize it?
More generally what is the fastest way to dump large amounts of a data to a standard hard drive?
What if you'll use memory mapped files limited to, for example, 1GB each? This should provide enough speed and buffer to work with all frames, especially if you'll manage to perform zero-copy frame allocation.
fwrite is buffered, which is what you want. Though with that big files/writes it shouldn't make much or any difference. Maybe experiment with a larger stream buffer with the setbuf call.
Since you are limited by physical disk i/o speeds, as long as you are making it as easy as possible for the system to use each available disk io efficiently there's not really more you can do.
vmstat on linux (other similar tools on other systems) can tell you how many disk i/os your disk is doing, so you can test if your changes help anything.
Asynchronous non-buffered output is a key to success in your case. Buffered IO will only cause double-buffering overhead and sync IO will make HDD heads missing sequential sectors.
Boost.Asio provides a relatively good encapsulation of system-specific APIs for popular platforms.
There are few things to remember:
on most non-Windows platforms you will have to write to raw partitions go get system's bufferization and internal threading out of the way.
keep the write queue non-empty all the time, so the SATA controller can help you by means of NCQ.
pay attention to system-specific requirements to buffer alignment and size for async non-buffered IO to work.
file open mode is also important to make the system to do what you want.
I'm trying to send a lot of data(basically data records converted to a string) over a socket and its slowing down the performance of the rest of my program. Is it possible to compress the data using gzip etc. and uncompress it at the other end?
Yes. The easiest way to implement this is to use the venerable zlib library.
The compress() and uncompress() utility functions may be what you're after.
Yes, but compression and decompression have their costs as well.
You might want to consider using another process or thread to handle the data transfer; this is probably harder than merely compressing, but will scale better when your data load increases n-fold.
Yes, it's possible. zlib is one library for doing this sort of compression and decompression. However, you may be better served by serializing your data records in a binary format rather than as a string; that should improve performance, possibly even more so than using compression.
Of course you can do that. When sending binary data, you have to take care of endiannes of the platform.
However, are you sure your performance problems will be solved through compression of sent data? You'll still have additional steps (compression/decompression, possibly solving endiannes issues).
Think about how the communication through sockets is done. Are you using synchronous or asynchronous communication. If you do the reads and writes synchronous, then you can feel performance penalities...
You may use AdOC a library to transparently overload socket system calls
http://www.labri.fr/perso/ejeannot/adoc/adoc.html
It does compression on the fly if it finds that it would be profitable.
Assuming the following for...
Output:
The file is opened...
Data is 'streamed' to disk. The data in memory is in a large contiguous buffer. It is written to disk in its raw form directly from that buffer. The size of the buffer is configurable, but fixed for the duration of the stream. Buffers are written to the file, one after another. No seek operations are conducted.
...the file is closed.
Input:
A large file (sequentially written as above) is read from disk from beginning to end.
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Some possible considerations:
Guidelines for choosing the optimal buffer size
Will a portable library like boost::asio be too abstracted to expose the intricacies of a specific platform, or can they be assumed to be optimal?
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
I realize that this will have platform-specific considerations. I welcome general guidelines as well as those for particular platforms.
(my most immediate interest in Win x64, but I am interested in comments on Solaris and Linux as well)
Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?
Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.
Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).
Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.
Some possible considerations:
Guidelines for choosing the optimal buffer size
Page size for starters, but be ready to tune from there.
Will a portable library like boost::asio be too abstracted to expose the intricacies
of a specific platform, or can they be assumed to be optimal?
Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.
Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?
Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.
A general advice is to turn off buffering and read/write in large chunks (but not too large, then you will waste too much time waiting for the whole I/O to complete where otherwise you could start munching away at the first megabyte already. It's trivial to find the sweet spot with this algorithm, there's only one knob to turn: the chunk size).
Beyond that, for input mmap()ing the file shared and read-only is (if not the fastest, then) the most efficient way. Call madvise() if your platform has it, to tell the kernel how you will traverse the file, so it can do readahead and throw out the pages afterwards again quickly.
For output, if you already have a buffer, consider underpinning it with a file (also with mmap()), so you don't have to copy the data in userspace.
If mmap() is not to your liking, then there's fadvise(), and, for the really tough ones, async file I/O.
(All of the above is POSIX, Windows names may be different).
For Windows, you'll want to make sure you use the FILE_FLAG_SEQUENTIAL_SCAN in your CreateFile() call, if you opt to use the platform specific Windows API call. This will optimize caching for the I/O. As far as buffer sizes go, a buffer size that is a multiple of the disk sector size is typically advised. 8K is a nice starting point with little to be gained from going larger.
This article discusses the comparison between async and sync on Windows.
http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
As you noted above it all depends on the machine / system / libraries that you are using. A fast solution on one system may be slow on another.
A general guideline though would be to write in as large of chunks as possible.Typically writing a byte at a time is the slowest.
The best way to know for sure is to code a few different ways and profile them.
You asked about C++, but it sounds like you're past that and ready to get a little platform-specific.
On Windows, FILE_FLAG_SEQUENTIAL_SCAN with a file mapping is probably the fastest way. In fact, your process can exit before the file actually makes it on to the disk. Without an explicitly-blocking flush operation, it can take up to 5 minutes for Windows to begin writing those pages.
You need to be careful if the files are not on local devices but a network drive. Network errors will show up as SEH errors, which you will need to be prepared to handle.
On *nixes, you might get a bit higher performance writing sequentially to a raw disk device. This is possible on Windows too, but not as well supported by the APIs. This will avoid a little filesystem overhead, but it may not amount to enough to be useful.
Loosely speaking, RAM is 1000 or more times faster than disks, and CPU is faster still. There are probably not a lot of logical optimizations that will help, except avoiding movements of the disk heads (seek) whenever possible. A dedicated disk just for this file can help significantly here.
You will get the absolute fastest performance by using CreateFile and ReadFile. Open the file with FILE_FLAG_SEQUENTIAL_SCAN.
Read with a buffer size that is a power of two. Only benchmarking can determine this number. I have seen it to be 8K once. Another time I found it to be 8M! This varies wildly.
It depends on the size of the CPU cache, on the efficiency of OS read-ahead and on the overhead associated with doing many small writes.
Memory mapping is not the fastest way. It has more overhead because you can't control the block size and the OS needs to fault in all pages.
On Linux, buffered reads and writes speed up things a lot up, increasingly with increasing buffers sizes, but the returns are diminishing and you generally want to use BUFSIZ (defined by stdio.h) as larger buffer sizes won't help much.
mmaping provides the fastest access to files, but the mmap call itself is rather expensive. For small files (16KiB) read and write system calls win (see https://stackoverflow.com/a/39196499/1084774 for the numbers on reading through read and mmap).