Requests issue decoding gzip - python-2.7

I'm trying to pull a large number of text files from a website using the requests package where some of the files are available outright as text and others are compressed text files.
tmpHtml = 'https://website.com/csv/pwr/someData.dat.gz'
tmpReq = requests.get(tmpHtml, proxies = proxy_w_auth, auth = (usr, pw))
When I pull the uncompressed files, everything works well however when I pull one of the compressed files I get lots of the following:
'\x1f\x8b\x08\x08\xe5\xc6\xd9A\x00\x03someData.dat\x00\xa5\x9d\xcbn\x1c\xb9\x19\x85\xf7\x01\xf2\x0e\xfd\x00Q\xa9X,^j\xa9\xc8\x9a\xb1\x9dX\x16dM\x12/\r\x8c\x0712\x19\x0f\xb2\t\x02\xf4\xc3\xa7\xba\xeeM\x9e\x9f<\xa46s\x93\xf1\r\x8b\xfd\x7fl\x9e\xe2E/\xcfwo\x1eNo\xee^\x1e\xceo\x7f\xfa\xf3\xf9\xe9\xf9\xe3\x9b\x9f\xee_\xce\x9f^\x9e\xdf=\x9d\xef?>\xbe<\xdf\x8d\xff\xba\xfe\xc3\xe9\xe5\xf3\xd3\xc3\xf4\xc3\xbf\x8c\x7f{xy\xf9\xeb\xc3\x87\x87\xc7\x97\xd3\xd3\xf3\xbb\xfb\x87\xf3\xe3\xc3\xcb\xe9\xfe\xed\xdd\xe3\x8f\x0f\xe7\x87\x7f<\xbd{\xbe{y\xf7\xf1qb\xff\xf1\x0f\xeaV\xdfvmk\xce\xf7\xdf~;\xff\xf0\xed\xb7\xd3\xa7\xff~\xf9\xfd\xe6\xe9\xeb\x97\x7f\xfd\xe9\xf4\xc3\xd3\xe9\x97\xef\xff9]\x10\xeaV-\x7f\xec\xdd\xe3\xf9\x87\xf3\xb9W\x8d\xf6\xe7\x1b\xd3\xf4n\xfc\x99\x9e\x7fH\xd3\xba\x90f\x1ak\xce7\xbaQ\xe3\x8f:_\x06\xd31ldu\xe3_tq\xc3z\x91\xd5\xdfvC\x19\xcb\x84,\xdd\xb8\x11\xa6\x9a\xce\x8c?+m\x99\ri\xf6\xc2\xb9i\xc7\xa6\xd9[\xdd\x96\xc1\\\x003vn\xda\xf8\x83\xd2\xa7\xf4\x12\xca\x17?\xe2\x10u\xd8\xe5\xf9\xc6\xa7\x1c\x8a\x1fP\xb5
I can see the file name in the beginning of the string that is returned but I'm not sure how I can actually extract the content. According to the requests documentation, it should automatically be decompressing gz files?
http://requests.readthedocs.org/en/latest/community/faq/
The response object looks like it has gzip in the headers as well:
{'Accept': '/', 'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.7.0 CPython/2.7.10 Windows/7'}
Any suggestions would be much appreciated.

Sometimes web clients request that the server compress a file before sending it. Not .gz files, mind you, since you wouldn't compress something twice. This cuts down the file size, especially for large text files. The client then decompresses it automatically before displaying it to the user. This is what the requests docs in your question describe. You do not have to worry about this for your use-case.
To decompress a gzipped file, you have to either decompress it in memory using gzip (part of the standard lib) or write it to disk in 'wb' mode and use the gzip utility.

Related

Downloading Gzip'ed content with libcurl without deflating

I want to retrieve webpages in their gzipped format, but instead of deflating it, I'd like to write the gzipped bytes to a file stream. Is this possible? If I don't send the correct accepts header for gzip, I get the normal html content, but if I do send the accepts gzip header, then it appears libcurl automatically defaltes it. Am I able to retrieve the original compressed bytes without the overhead of decompressing?
If you pass in Accept-Encoding: gzip as a custom header with CURLOPT_HTTPHEADER, libcurl won't decompress it automatically. It will only automatically decompress gzip if you set CURLOPT_ACCEPT_ENCODING.

Chunk download with OneDrive Rest API

this is the first time I write on StackOverflow. My question is the following.
I am trying to write a OneDrive C++ API based on the cpprest sdk CasaBlanca project:
https://casablanca.codeplex.com/
In particular, I am currently implementing read operations on OneDrive files.
Actually, I have been able to download a whole file with the following code:
http_client api(U("https://apis.live.net/v5.0/"), m_http_config);
api.request(methods::GET, file_id +L"/content" ).then([=](http_response response){
return response.body();
}).then([=]( istream is){
streambuf<uint8_t> rwbuf = file_buffer<uint8_t>::open(L"test.txt").get();
is.read_to_end(rwbuf).get();
rwbuf.close();
}).wait()
This code is basically downloading the whole file on the computer (file_id is the id of the file I am trying to download). Of course, I can extract an inputstream from the file and using it to read the file.
However, this could give me issues if the file is big. What I had in mind was to download a part of the file while the caller was reading it (and caching that part if he came back).
Then, my question would be:
Is it possible, using the OneDrive REST + cpprest downloading a part of a file stored on OneDrive. I have found that uploading files in chunks seems apparently not possible (Chunked upload (resumable upload) for OneDrive?). Is this true also for the download?
Thank you in advance for your time.
Best regards,
Giuseppe
OneDrive supports byte range reads. And so you should be able to request chunks of whatever size you want by adding a Range header.
For example,
GET /v5.0/<fileid>/content
Range: bytes=0-1023
This will fetch the first KB of the file.

How to get the content-type of a file

I am implementing a HTTP/1.0 server that processes GET or HEAD request.
I've finished Date, Last-Modified, and Content-Length, but I don't know how to get the Content-Type of a file.
It has to return directory for directory(which I can do using stat() function), and for a regular file, text/html for text or html file, and image/gif for image or gif file.
Should this be hard-coded, using the name of the file?
I wonder if there is any function to get this Content-Type.
You could either look at the file extension (which is what most web servers do -- see e.g. the /etc/mime.types file; or you could use libmagic to automatically determine the content type by looking at the first few bytes of the file.
It depends how sophisticated you want to be.
If the files in question are all properly named and there are only several types to handle, having a switch based file suffix is sufficient. Going to the extreme case, making the right decision no matter what the file is would probably require either duplicating the functionality of Unix file command or running it on file in question (and then translating the output to the proper Content-Type).

When will NSURLConnection decompress a compressed resource?

I've read how NSURLConnection will automatically decompress a compressed (zipped) resource, however I can not find Apple documentation or official word anywhere that specifies the logic that defines when this decompression occurs. I'm also curious to know how this would relate to streamed data.
The Problem
I have a server that streams files to my app using a chunked encoding, I believe. This is a WCF service. Incidentally, we're going with streaming because it should alleviate server load during high use and also because our files are going to be very large (100's of MB). The files could be compressed or uncompressed. I think in my case because we're streaming the data, the Content-Encoding header is not available, nor is Content-Length. I only see "Transfer-Encoding" = Identity in my response.
I am using the AFNetworking library to write these files to disk with AFHTTPRequestOperation's inputStream and outputStream. I have also tried using AFDownloadRequestOperation as well with similar results.
Now, the AFNetworking docs state that compressed files will automatically be decompressed (via NSURLConnection, I believe) after download and this is not happening. I write them to my documents directory, with no problems. Yet they are still zipped. I can unzip them manually, as well. So the file is not corrupted. Do they not auto-unzip because I'm streaming the data and because Content-Encoding is not specified?
What I'd like to know:
Why are my compressed files not decompressing automatically? Is it because of streaming? I know I could use another library to decompress afterward, but I'd like to avoid that if possible.
When exactly does NSURLConnection know when to decompress a downloaded file, automatically? I can't find this in the docs anywhere. Is this tied to a header value?
Any help would be greatly appreciated.
NSURLConnection will decompress automatically when the appropriate Content-Encoding (e.g. gzip) is available in the response header. That's down to your server to arrange.

Any reasons not to gzip documents delivered via HTTP?

I remember someone telling me that gzipped content is not cached on some browsers?
is this true?
Are there any other reasons why I shouldn't gzip my content (pages, javascript and css files) with htaccess?
The other reason is it obviously increase CPU load, but whether this is a problem depends on your content type and your traffic.
If you are going to use GZip from with .htaccess, be sure to wrap it in a condition whereby it only executed of the mod_gzip module exists, this will make the site / app more portable if moving it to another server.
If you opt to use .htaccess GZipped content the browser will receive compressed content if it supports it, or received the normal uncompressed version if it doesn't
If you are delivering mostly .gz files, then obviously you don't want to gzip them. Otherwise it's probably a good idea, especially for cache-able content. I have never heard of caches not working with gzipped content.
I think you need to handle both GZipped and not gzipped data since IE6 and GZipping do not live together nicely.
Otherwise I cant think of an issue
If you need to stream the content of a page, or want to use Response.Flush, then you can't use compression/gzip.