When to compress HTTP POST data? - web-services

I am writing a client that frequently sends small portions of data via HTTPS. The data can be anything from 50 UTF-8 chars to 10k chars - mostly human readable log data. I am using RFC standard HTTP compression.
I need to optimise for CPU consumption. I wonder if there is any threshold, something like: if a string is more than 100 chars, then it is worth doing compression.
Should I always apply compression for HTTP payload or only when its worth doing?

In my opinion, the CPU overhead caused by compression on very small files is close to zero. So, I don't think it's worth doing a test on the file size.
The most worrying aspect in your description is actually that you apparently have many small requests over HTTPS. If it's not already the case, I would recommend to enable SSL session caching in order to avoid too many SSL handshakes, which are likely to consume more CPU than compressing 100 bytes or so.

Related

Could gzip compression cause data corruption?

I'm trying to come up with a solution to compress few petabytes of data I have which will be stored in AWS S3. I was thinking of using gzip compression and was wondering if compression could corrupt data. I tried searching but was not able to find any specific instances where gzip compression actually corrupted the data such that it was no longer recoverable.
I'm not sure if this is the correct forum for such question, but do I need to verify if data was correctly compressed? Also, any specific examples/data points would help.
I would not recommend using gzip directly on a large block of data in one shot.
Many times I have compressed entire drives using something similar to
dd if=/dev/sda conv=sync,noerror | gzip > /media/backup/sda.gz
and the data was unusable when I tried to restore it. I have reverted to not using compression
gzip is constantly being used all around the world and has gathered a very strong reputation for reliability. But no software is perfect. Nor is any hardware, nor is S3. Whether you need to verify the data ultimately depends on your needs, but I think a hard disk failure is more likely than a gzip corruption at this point.
GZIP compression, like just about any other commonly-used data compression algorithm, is lossless. That means when you decompress the compressed data, you get back an exact copy of the original (and not something kinda sorta maybe like it, like JPEG does for images or MP3 for audio).
As long as you use a well-known program (like, say, gzip) to do the compression, are running on reliable hardware, and don't have malware on your machine, the chances of compression introducing data corruption are basically nil.
If you care about this data, then I would recommend compressing it, and the comparing the decompression of that with the original before deleting the original. This checks for a bunch of possible problems, such as memory errors, mass storage errors, cpu errors, transmission errors, as well as the least likely of all of these, a gzip bug.
Something like gzip -dc < petabytes.gz | cmp - petabytes in Unix would be a way to do it without having to store the original data again.
Also if loss of some of the data would still leave much of the remaining data useful, I would break it up into pieces so that if one part is lost, the rest is recoverable. Any part of a gzip file requires all of what precedes it to be available and correct in order to decompress that part.

When to use raw binary and when base64?

I want to develop a service which receives files from users. At first, I was willing to implement uploads using raw binary in order to save time (base64 increases file size by about 33%), but reading about base64 it seems to be very useful if you don't want problems uploading files.
The question is what are the downsides of implementing raw binary uploads? And in which cases it makes sense? In this case I will develop client and server so I will have control over these two, but what about routers or network, can they corrupt data if not in base64?
I'm trying to investigate what dropbox or google drive do and why, but I can't find an article.
You won't have any problems using raw binary for file uploads. All Internet Protocol networking hardware is required to be 8 bit clean - that is to transmit all 8 bits of every byte/octet.
If you choose to use the TCP protocol, it guarantees reliable transmission of octets (bytes). Encoding using base64 would be a waste of time and bandwidth.

Is there any compression available in libcurl

I need to transfer a huge file from local machine to remote machine using libcurl with C++. Is there any compression option available in-built with libcurl. As the data to be transferred is large (100 MB to 1 GB in size), it would be better if we have any such options available in libcurl itself. I know we can compress the data and send it via libcurl. But just want to know is there any better way of doing so.
Note: In my case, many client machines transfer such huge data to remote server at regular interval of time.
thanks,
Prabu
According to curl_setopt() and options CURLOPT_ENCODING, you may specify:
The contents of the "Accept-Encoding: " header. This enables decoding
of the response. Supported encodings are "identity", "deflate", and
"gzip". If an empty string, "", is set, a header containing all
supported encoding types is sent.
Here are some examples (just hit search in your browser and type in compression), but I don't know hot exactly does it work and whether it expect already gzipped data.
You still may use gzcompress() and send compressed chunks on your own (and I would do the task this way... you'll have better control on what's actually going on and you'll be able to change used algorithms).
You need to send your file with zlib compression by yourself. And perhaps there are some modification needed on the server-side.

Redis is slow to get large strings

I'm kind of a newb with Redis, so I apologize if this is a stupid question.
I'm using Django with Redis as a cache.
I'm pickling a collection of ~200 objects and storing it in Redis.
When I request the collection from Redis, Django Debug Toolbar is informing me that the request to Redis is taking ~3 seconds. I must be doing something horribly wrong.
The server has 3.5GB of ram, and it looks like Redis is currently using only ~50mb, so I'm pretty sure it's not running out of memory.
When I get the key using the redis-cli it takes just as long as when I do it from Django
Running strlen on the key from redis-cli I'm informed that the length is ~20 million (Is this too large?)
What can I do to have Redis return the data faster? If this seems unusual, what might be some common pitfalls? I've seen this page on latency problems, but nothing has really jumped out at me yet.
I'm not sure if it's a really bad idea to store a large amount of data in one key, or if there's just something wrong with my configuration. Any help or suggestions or things to read would be greatly appreciated.
Redis is not designed to store very large objects. You are not supposed to store your entire collection in a single string in Redis, but rather use Redis list or set as a container for your objects.
Furthermore, the pickle format is not optimized for space ... you would need a more compact format. Protocol Buffers, MessagePack, or even plain JSON, are probably better for this. You should consider to apply a light compression algorithm before storing your data (like Snappy, LZO, Quicklz, LZF, etc ...).
Finally, the performance is probably network bound. On my machine, retrieving a 20 MB object from Redis takes 85 ms (not 3 seconds). Now, if I run the same test using a remote server, it takes 1.781 seconds, which is expected on this 100 Mbit/s network. The duration is fully dependent on the network bandwidth.
Last point: be sure to use a recent Redis version - a number of optimization have been done to deal with large objects.
It's most likely just the size of the string. I'd look at whether your objects are being serialized efficiently.

how much gzipping burdens the client?

I'm optimizing our web service, and heard about gzip.
It would be good if we can reduce the network load using gzip, but I'm a little worried about how much unpacking overhead it'll bring to client.
Especially, our service uses javascript very often - which means that page rending in web browser will cost CPU time.
I cannot sure that taking cpu time to decompress gzip packet (instead of taking care of javascript) would bring positive effect to our service still.
Things like HTML and javascript libraries, particularly static files, are good candidates for compression. images aren't - they're already compressed.
Decompression of gzip compressed data is very fast compared to most internet connections - a quick test on my PC (AMD phenom 2.8GHz) results in decompression of about 170m/second, in a single core. So a ~200k javascript file would be decompressed by a modern browser on a modern PC in about 2 milliseconds, and javascript typically compresses to about 25% of its original size (~35% if it is already minified).
Of course, just what proportion of your network load is made up of decompressed javascript is another matter.