When to use raw binary and when base64? - web-services

I want to develop a service which receives files from users. At first, I was willing to implement uploads using raw binary in order to save time (base64 increases file size by about 33%), but reading about base64 it seems to be very useful if you don't want problems uploading files.
The question is what are the downsides of implementing raw binary uploads? And in which cases it makes sense? In this case I will develop client and server so I will have control over these two, but what about routers or network, can they corrupt data if not in base64?
I'm trying to investigate what dropbox or google drive do and why, but I can't find an article.

You won't have any problems using raw binary for file uploads. All Internet Protocol networking hardware is required to be 8 bit clean - that is to transmit all 8 bits of every byte/octet.
If you choose to use the TCP protocol, it guarantees reliable transmission of octets (bytes). Encoding using base64 would be a waste of time and bandwidth.

Related

Streaming File Delta Encoding/Decoding

Here's the problem - I want to generate the delta of a binary file (> 1 MB in size) on a server and send the delta to a memory-constrained (low on RAM and no dynamic memory) embedded device over HTTP. Deltas are preferred (as opposed to sending the full binary file from the server) because of the high cost involved in transmitting data over the wire.
Trouble is, the embedded device cannot decode deltas and create the contents of the new file in memory. I have looked into various binary delta encoding/decoding algorithms like bsdiff, VCDiff etc. but was unable to find libraries that supported streaming.
Perhaps, rather than asking if there are suitable libraries out there, are there alternate approaches I can take that will still solve the original problem (send minimal data over the wire)? Although it would certainly help if there are suitable delta libraries out there that support streaming decode (written in C or C++ without using dynamic memory).
Maintain a copy on the server of the current file as held by the embedded device. When you want to send an update, XOR the new version of the file with the old version and compress the resultant stream with any sensible compressor. (Algorithms which allow high-cost encoding to allow low-cost decoding would be particularly helpful here.) Send the compressed stream to the embedded device, which reads the stream, decompresses it on the fly and XORs directly (a copy of) the target file.
If your updates are such that the file content changes little over time and retains a fixed structure, the XOR stream will be predominantly zeroes, and will compress extremely well: number of bytes transmitted will be small, effort to decompress will be low, memory requirements on the embedded device will be minimal. The further your model is from these assumptions, the less this approach will gain you.
Since you said the delta could be arbitrarily random (from zero delta to a completely different file), compression of the delta may be a lost cause. Lossless compression of random binary data is theoretically impossible. Also, since the embedded device has limited memory anyway, using a sophisticated -and therefore computationally expensive- library for compression/decompression of the occasional "simple" delta will probably be infeasible.
I would recommend simply sending the new file to the device in raw byte format, and overwriting the existing old file.
As Kevin mentioned, compressing random data should not be your goal. A few more comments about the type of data your working with would be helpful. Context is key in compression.
You used the term image which makes it sound like the classic video codec challenge. If you've ever seen weird video aliasing effects that impact the portion of the frame that has changed, and then suddenly everything clears up. You've likely witnessed the notion of a key frame along with a series of delta frames. Where the delta frames were not properly applied.
In this model, the server decides what's cheaper:
complete key frame
delta commands
The delta commands are communicated as a series of write instructions that can overlay the clients existing buffer.
Example Format:
[Address][Length][Repeat][Delta Payload]
[Address][Length][Repeat][Delta Payload]
[Address][Length][Repeat][Delta Payload]
There are likely a variety of methods for computing these delta commands. A brute force method would be:
Perform Smith Waterman between two images.
Compress the resulting transform into delta commands.

Is there any compression available in libcurl

I need to transfer a huge file from local machine to remote machine using libcurl with C++. Is there any compression option available in-built with libcurl. As the data to be transferred is large (100 MB to 1 GB in size), it would be better if we have any such options available in libcurl itself. I know we can compress the data and send it via libcurl. But just want to know is there any better way of doing so.
Note: In my case, many client machines transfer such huge data to remote server at regular interval of time.
thanks,
Prabu
According to curl_setopt() and options CURLOPT_ENCODING, you may specify:
The contents of the "Accept-Encoding: " header. This enables decoding
of the response. Supported encodings are "identity", "deflate", and
"gzip". If an empty string, "", is set, a header containing all
supported encoding types is sent.
Here are some examples (just hit search in your browser and type in compression), but I don't know hot exactly does it work and whether it expect already gzipped data.
You still may use gzcompress() and send compressed chunks on your own (and I would do the task this way... you'll have better control on what's actually going on and you'll be able to change used algorithms).
You need to send your file with zlib compression by yourself. And perhaps there are some modification needed on the server-side.

how much gzipping burdens the client?

I'm optimizing our web service, and heard about gzip.
It would be good if we can reduce the network load using gzip, but I'm a little worried about how much unpacking overhead it'll bring to client.
Especially, our service uses javascript very often - which means that page rending in web browser will cost CPU time.
I cannot sure that taking cpu time to decompress gzip packet (instead of taking care of javascript) would bring positive effect to our service still.
Things like HTML and javascript libraries, particularly static files, are good candidates for compression. images aren't - they're already compressed.
Decompression of gzip compressed data is very fast compared to most internet connections - a quick test on my PC (AMD phenom 2.8GHz) results in decompression of about 170m/second, in a single core. So a ~200k javascript file would be decompressed by a modern browser on a modern PC in about 2 milliseconds, and javascript typically compresses to about 25% of its original size (~35% if it is already minified).
Of course, just what proportion of your network load is made up of decompressed javascript is another matter.

Transferring large files with web services

What is the best way to transfer large files with web services ? Presently we are using the straight forward option to transfer the binary data by converting the binary data into base 64 format and embeding the base 64 encoding into soap envelop itself.But it slows down the application performance considerably.Please suggest something for performance improvement.
In my opinion the best way to do this is to not do this!
The Idea of Webservices is not designed to transfer large files. You should really transfer an url to the file and let the receiver of the message pull the file itsself.
IMHO that would be a better way to do this then encoding and sending it.
Check out MTOM, a W3C standard designed to transfer binary files through SOAP.
From Wikipedia:
MTOM provides a way to send the binary
data in its original binary form,
avoiding any increase in size due to
encoding it in text.
Related resources:
SOAP Message Transmission Optimization Mechanism
Message Transmission Optimization Mechanism (Wikipedia)

Multi-part gzip file random access (in Java)

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.
I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)
I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?
The BGZF file format, compatible with GZIP was developped by the biologists.
(...) The advantage of
BGZF over conventional gzip is that
BGZF allows for seeking without having
to scan through the entire file up to
the position being sought.
In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java
The design of GZIP, as you have realized, is not friendly to random access.
You can do as you describe, and then if you run into an error in the decompressor, conclude that the signature you found was actually compressed data.
If you finish decompressing, then it's easy to verify the validity of the stream just decompressed, via the CRC32.
If the files are not so big, you might consider just de-compressing all of the entries in series, and retaining the offsets of the signatures so as to build a directory. As you decompress, dump the bytes to a bit bucket. At that point you will have generated a directory, and you can then support random access based on filename, date, or other metadata.
This will be reasonably fast for files below 100k. Just as a guess, if you had 10 files of around 100k each, it would probably be done in 2s on a modern CPU. This is what I mean by "pretty fast". But only you know the perf requirements of your application .
Do you have a GZipInputStream class? If so you are half-way there.