Why does extraction of gzipped sitemap fail in Scrapy 0.24.4? - python-2.7

I am getting the "Not a gzipped file" exception while retrieving a gzipped sitemap xml (tested on amazon.de)
According to the bugtrackers, there used to be a bug regarding "Not a gzipped file"
I am using Python 2.7.3 and Scrapy 0.24.4
Can anyone confirm this as a bug or am I overseeing something?
UPDATE
I think this is some valuable information, already posted on github as well
Possible bug:
retrieving a gzipped sitemap xml (tested on amazon.de) fails.
Reproduce with:
modify /utils/gz.py gunzip method to write the incoming data to a file.
gunzip the file on the command line.
the unzipped file contains garbled content
gunzip that file with garbled content a second time and get the correct content
I suspect that the content coming from the target server is already gzip compressed and scrapy has a bug that causes the gzip http decompression to not work properly, resulting in a double compressed file arriving at the /utils/gz.py gunzip method

Related

How to download and get video informations at the same time in youtube-dl?

In youtube-dl cli, How can i get information about the video (in json output) while the video beign downloaded by the app?
When i use this command:
youtube-dl https://www.youtube.com/watch?v=jNQXAC9IVRw
It only shows me output filename, But what a about the duration, resolution, etc... ?
I do this with 2 requests, But is it possible in on go?
It would be great if it dump the video meta data into a json file as well as output filename, Because i also struggling to pragmatically get the path of downloaded file (i have to use regex)
Just add --print-json in your command line.
youtube-dl --print-json https://www.youtube.com/watch?v=jNQXAC9IVRw
This outputs a big JSON, while video is still downloading

CSV Download - Chrome issue

On one of my routes, a CSV file is downloaded to the user with an incremental filename each time it is executed.
I've hit some kind of caching issue in Chrome - subsequent executions download the original file - but Safari on a mac works fine every time.
I've googled it, and added cache_timeout=0 - but that didn't resolve my issue. I don't want to generate a file in an IOstream as some others have suggested elsewhere. Any easy fixes, or shall I redirect to a URL with part of the filename embedded in the URL?
return send_file(path, as_attachment=True, cache_timeout=0)

What format is the Dump File in SoapUI?

I make a request to a Web-Service to download a file as an MTOM/XOP attachment; The file is an Excel file.xlsx;
The response in the SoapUI tool comes back 200-OK, with typical soap-envelope, and the attachment is there in the SOAP-UI grid, and I can export the attachment from the SOAP-UI grid to a file, and it verifies ok (it is my original Excel file).
The real question is the Dump file that got created is some garbled binary file, and I got no idea what its contents are, what format it is in, whether it includes both the soap xml response and attachment, but more importantly how can I decode it to be useful ?
Got the answer, rather than deleting the question, I'll leave it up here in case anyone else struggles with this as I did !
In the Raw response of SOAP-UI, we can see "Content-Encoding: gzip"; this is dependent on the config of the web-service / web-server.
So after decoding the Dump File with GZipStream (I used C#), I got an intelligible format, whereupon I can see the original Excel file embedded in there !

cPickle.load() doesnt accept non-.gz files, what can I use for .pkl files?

I am trying to run an example of a LSTM recurrent neural network that is presented in this git: https://github.com/mesnilgr/is13.
I've installed theano and everything and when I got to the point of running the code, I've noticed the data was not being downloaded, so I've opened an issue on the github (https://github.com/mesnilgr/is13/issues/12) and this guy came up with a solution that consisted in:
1-get the data from the dropbox link he provides.
2- change the code of the 'load.py' file to download, and read the data properly.
The only issue is that the data in the dropbox folder(https://www.dropbox.com/s/3lxl9jsbw0j7h8a/atis.pkl?dl=0) is not a compacted .gz file as, I suppose, was the data from the original repository. So I dont have enough skill to change the code in order to do with the uncompressed data exaclty what it would do with the compressed one. Can someone help me?
The modification suggested and the changes I've done are described on the issue I've opened on the git(https://github.com/mesnilgr/is13/issues/12).
It looks like your code is using
gzip.open(...)
But if the file is not gzipped then you probably just need to remove the gzip. prefix and use
open(...)

When will NSURLConnection decompress a compressed resource?

I've read how NSURLConnection will automatically decompress a compressed (zipped) resource, however I can not find Apple documentation or official word anywhere that specifies the logic that defines when this decompression occurs. I'm also curious to know how this would relate to streamed data.
The Problem
I have a server that streams files to my app using a chunked encoding, I believe. This is a WCF service. Incidentally, we're going with streaming because it should alleviate server load during high use and also because our files are going to be very large (100's of MB). The files could be compressed or uncompressed. I think in my case because we're streaming the data, the Content-Encoding header is not available, nor is Content-Length. I only see "Transfer-Encoding" = Identity in my response.
I am using the AFNetworking library to write these files to disk with AFHTTPRequestOperation's inputStream and outputStream. I have also tried using AFDownloadRequestOperation as well with similar results.
Now, the AFNetworking docs state that compressed files will automatically be decompressed (via NSURLConnection, I believe) after download and this is not happening. I write them to my documents directory, with no problems. Yet they are still zipped. I can unzip them manually, as well. So the file is not corrupted. Do they not auto-unzip because I'm streaming the data and because Content-Encoding is not specified?
What I'd like to know:
Why are my compressed files not decompressing automatically? Is it because of streaming? I know I could use another library to decompress afterward, but I'd like to avoid that if possible.
When exactly does NSURLConnection know when to decompress a downloaded file, automatically? I can't find this in the docs anywhere. Is this tied to a header value?
Any help would be greatly appreciated.
NSURLConnection will decompress automatically when the appropriate Content-Encoding (e.g. gzip) is available in the response header. That's down to your server to arrange.