Downloading Gzip'ed content with libcurl without deflating - libcurl

I want to retrieve webpages in their gzipped format, but instead of deflating it, I'd like to write the gzipped bytes to a file stream. Is this possible? If I don't send the correct accepts header for gzip, I get the normal html content, but if I do send the accepts gzip header, then it appears libcurl automatically defaltes it. Am I able to retrieve the original compressed bytes without the overhead of decompressing?

If you pass in Accept-Encoding: gzip as a custom header with CURLOPT_HTTPHEADER, libcurl won't decompress it automatically. It will only automatically decompress gzip if you set CURLOPT_ACCEPT_ENCODING.

Related

C++ Compress string using zlib, naming of document

I'm using zlib to compress a stream of txt to a gz gzip file, and it's working well. However, it seems to name the file inside the gzip, exactly the same as my gz name.
I'm wondering is there any way to change the naming of the file that's been compressed?
I would rather it name like the following:
/myfile.gz/myfile
Where myfile is the document that's inside of the compressed gzip file, and myfile.gz is the gzipped file itself.
Is there any way to control these namings?
I think what you're saying is that when you decompress whatever.gz, you get a file named whatever in the current directory. That is the default behavior of the gzip utility, and it is not affected by how the gzip file is made. The contents of the gzip file cannot direct the decompressed data to some other directory. (If it could, that would be a security issue.)
It is possible to store a file name in the gzip header, in which case gzip -N whatever.gz will decompress to the name in the header as opposed to whatever. However it will be a file in the current directory using just the base name in the header. Any path information in the file name in the gzip header is ignored.

Requests issue decoding gzip

I'm trying to pull a large number of text files from a website using the requests package where some of the files are available outright as text and others are compressed text files.
tmpHtml = 'https://website.com/csv/pwr/someData.dat.gz'
tmpReq = requests.get(tmpHtml, proxies = proxy_w_auth, auth = (usr, pw))
When I pull the uncompressed files, everything works well however when I pull one of the compressed files I get lots of the following:
'\x1f\x8b\x08\x08\xe5\xc6\xd9A\x00\x03someData.dat\x00\xa5\x9d\xcbn\x1c\xb9\x19\x85\xf7\x01\xf2\x0e\xfd\x00Q\xa9X,^j\xa9\xc8\x9a\xb1\x9dX\x16dM\x12/\r\x8c\x0712\x19\x0f\xb2\t\x02\xf4\xc3\xa7\xba\xeeM\x9e\x9f<\xa46s\x93\xf1\r\x8b\xfd\x7fl\x9e\xe2E/\xcfwo\x1eNo\xee^\x1e\xceo\x7f\xfa\xf3\xf9\xe9\xf9\xe3\x9b\x9f\xee_\xce\x9f^\x9e\xdf=\x9d\xef?>\xbe<\xdf\x8d\xff\xba\xfe\xc3\xe9\xe5\xf3\xd3\xc3\xf4\xc3\xbf\x8c\x7f{xy\xf9\xeb\xc3\x87\x87\xc7\x97\xd3\xd3\xf3\xbb\xfb\x87\xf3\xe3\xc3\xcb\xe9\xfe\xed\xdd\xe3\x8f\x0f\xe7\x87\x7f<\xbd{\xbe{y\xf7\xf1qb\xff\xf1\x0f\xeaV\xdfvmk\xce\xf7\xdf~;\xff\xf0\xed\xb7\xd3\xa7\xff~\xf9\xfd\xe6\xe9\xeb\x97\x7f\xfd\xe9\xf4\xc3\xd3\xe9\x97\xef\xff9]\x10\xeaV-\x7f\xec\xdd\xe3\xf9\x87\xf3\xb9W\x8d\xf6\xe7\x1b\xd3\xf4n\xfc\x99\x9e\x7fH\xd3\xba\x90f\x1ak\xce7\xbaQ\xe3\x8f:_\x06\xd31ldu\xe3_tq\xc3z\x91\xd5\xdfvC\x19\xcb\x84,\xdd\xb8\x11\xa6\x9a\xce\x8c?+m\x99\ri\xf6\xc2\xb9i\xc7\xa6\xd9[\xdd\x96\xc1\\\x003vn\xda\xf8\x83\xd2\xa7\xf4\x12\xca\x17?\xe2\x10u\xd8\xe5\xf9\xc6\xa7\x1c\x8a\x1fP\xb5
I can see the file name in the beginning of the string that is returned but I'm not sure how I can actually extract the content. According to the requests documentation, it should automatically be decompressing gz files?
http://requests.readthedocs.org/en/latest/community/faq/
The response object looks like it has gzip in the headers as well:
{'Accept': '/', 'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.7.0 CPython/2.7.10 Windows/7'}
Any suggestions would be much appreciated.
Sometimes web clients request that the server compress a file before sending it. Not .gz files, mind you, since you wouldn't compress something twice. This cuts down the file size, especially for large text files. The client then decompresses it automatically before displaying it to the user. This is what the requests docs in your question describe. You do not have to worry about this for your use-case.
To decompress a gzipped file, you have to either decompress it in memory using gzip (part of the standard lib) or write it to disk in 'wb' mode and use the gzip utility.

How to get the content-type of a file

I am implementing a HTTP/1.0 server that processes GET or HEAD request.
I've finished Date, Last-Modified, and Content-Length, but I don't know how to get the Content-Type of a file.
It has to return directory for directory(which I can do using stat() function), and for a regular file, text/html for text or html file, and image/gif for image or gif file.
Should this be hard-coded, using the name of the file?
I wonder if there is any function to get this Content-Type.
You could either look at the file extension (which is what most web servers do -- see e.g. the /etc/mime.types file; or you could use libmagic to automatically determine the content type by looking at the first few bytes of the file.
It depends how sophisticated you want to be.
If the files in question are all properly named and there are only several types to handle, having a switch based file suffix is sufficient. Going to the extreme case, making the right decision no matter what the file is would probably require either duplicating the functionality of Unix file command or running it on file in question (and then translating the output to the proper Content-Type).

When will NSURLConnection decompress a compressed resource?

I've read how NSURLConnection will automatically decompress a compressed (zipped) resource, however I can not find Apple documentation or official word anywhere that specifies the logic that defines when this decompression occurs. I'm also curious to know how this would relate to streamed data.
The Problem
I have a server that streams files to my app using a chunked encoding, I believe. This is a WCF service. Incidentally, we're going with streaming because it should alleviate server load during high use and also because our files are going to be very large (100's of MB). The files could be compressed or uncompressed. I think in my case because we're streaming the data, the Content-Encoding header is not available, nor is Content-Length. I only see "Transfer-Encoding" = Identity in my response.
I am using the AFNetworking library to write these files to disk with AFHTTPRequestOperation's inputStream and outputStream. I have also tried using AFDownloadRequestOperation as well with similar results.
Now, the AFNetworking docs state that compressed files will automatically be decompressed (via NSURLConnection, I believe) after download and this is not happening. I write them to my documents directory, with no problems. Yet they are still zipped. I can unzip them manually, as well. So the file is not corrupted. Do they not auto-unzip because I'm streaming the data and because Content-Encoding is not specified?
What I'd like to know:
Why are my compressed files not decompressing automatically? Is it because of streaming? I know I could use another library to decompress afterward, but I'd like to avoid that if possible.
When exactly does NSURLConnection know when to decompress a downloaded file, automatically? I can't find this in the docs anywhere. Is this tied to a header value?
Any help would be greatly appreciated.
NSURLConnection will decompress automatically when the appropriate Content-Encoding (e.g. gzip) is available in the response header. That's down to your server to arrange.

Is there any compression available in libcurl

I need to transfer a huge file from local machine to remote machine using libcurl with C++. Is there any compression option available in-built with libcurl. As the data to be transferred is large (100 MB to 1 GB in size), it would be better if we have any such options available in libcurl itself. I know we can compress the data and send it via libcurl. But just want to know is there any better way of doing so.
Note: In my case, many client machines transfer such huge data to remote server at regular interval of time.
thanks,
Prabu
According to curl_setopt() and options CURLOPT_ENCODING, you may specify:
The contents of the "Accept-Encoding: " header. This enables decoding
of the response. Supported encodings are "identity", "deflate", and
"gzip". If an empty string, "", is set, a header containing all
supported encoding types is sent.
Here are some examples (just hit search in your browser and type in compression), but I don't know hot exactly does it work and whether it expect already gzipped data.
You still may use gzcompress() and send compressed chunks on your own (and I would do the task this way... you'll have better control on what's actually going on and you'll be able to change used algorithms).
You need to send your file with zlib compression by yourself. And perhaps there are some modification needed on the server-side.