python-requests hanging when downloading a mass amount of files - python-2.7

I am trying to use python-request package to download a mass amount of files(like 10k+) from the web, each file size from several k to the largest as 100mb.
my script can run through fine for maybe 3000 files but suddenly it will hang.
I ctrl-c it and see it stuck at
r = requests.get(url, headers=headers, stream=True)
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 327, in send
timeout=timeout
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
body=body, headers=headers)
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
Here is my python code to do the download
basedir = os.path.dirname(filepath)
if not os.path.exists(basedir):
os.makedirs(basedir)
r = requests.get(url, headers=headers, stream=True)
with open(filepath, 'w') as f:
for chunk in r.iter_content(1024):
if chunk:
f.write(chunk)
f.flush()
I am not sure what went wrong, if anyone has a clue, please kindly share some insights.
Thanks.

This is not a duplicate of the question that #alfasin linked in their comment. Judging by the (limited) traceback you posted, the request itself is hanging (the first line shows it was executing r = requests.get(url, headers=headers, stream=True)).
What you should do is set a timeout and catch the exception that is raised when the request times out. Once you have the URL try it in a browser or with curl to ensure it responds properly, otherwise remove it from your list of URLs to request. If you find the misbehaving URL, please update your question with it.

I faced a similar situation and it seems like a bug in the requests package was causing this issue. Upgrading to requests package 2.10.0 fixed it for me.
For your reference the change log for Requests 2.10.0 shows that the embedded urllib3 was updated to version 1.15.1 Release history
And the release history for urllib3 (Release history ) shows that version 1.15.1 included fixes for:
Chunked transfer encoding when requesting with chunked=True. (Issue #790)
Fixed AppEngine handling of transfer-encoding header and bug in Timeout defaults checking. (Issue #763)

Related

Using tools.run_flow() raises SSLHandshake "certificate verify" error in Google Sheets API tutorial

I'm am pretty much following the Google Sheets getting started (in Python) to a tee. I've gotten the program to work on my Mac laptop, but it is failing as I am trying to run it on Windows. So far, I've checked that we do not have the firewall enabled on the machine.
Below is the error that appears after clicking through the authentication prompts that pop up in the browser.
Traceback (most recent call last):
File "Authenticate.py", line 47, in <module>
main()
File "Authenticate.py", line 43, in main
tools.run_flow(flow, store)
File "C:\johnsnow\packages\test\lib\site-packages\oauth2client\_helpers.py", line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File "C:\johnsnow\packages\test\lib\site-packages\oauth2client\tools.py", line 243, in run_flow
credential = flow.step2_exchange(code, http=http)
File "C:\johnsnow\packages\test\lib\site-packages\oauth2client\_helpers.py", line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File "C:\johnsnow\packages\test\lib\site-packages\oauth2client\client.py", line 2054, in step2_exchange
http, self.token_uri, method='POST', body=body, headers=headers)
File "C:\johnsnow\packages\test\lib\site-packages\oauth2client\transport.py", line 282, in request
connection_type=connection_type)
File "C:\johnsnow\packages\test\lib\site-packages\httplib2\__init__.py", line 1570, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "C:\johnsnow\packages\test\lib\site-packages\httplib2\__init__.py", line 1317, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "C:\johnsnow\packages\test\lib\site-packages\httplib2\__init__.py", line 1252, in _conn_request
conn.connect()
File "C:\johnsnow\packages\test\lib\site-packages\httplib2\__init__.py", line 1044, in connect
raise SSLHandshakeError(e)
httplib2.SSLHandshakeError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)
I'm not exactly sure how it happened as I was working in a virtualenv, but there was a dependency issue in the version of httplib2 and oauth2. After an uninstall and then reinstall, there was an error that the two libraries were incompatible.
Doing pip freeze showed I had httplib2==0.8 but oauth2client required httplib2=>0.9.
This was resolved by doing pip install --upgrade httplib2.

Selenium doesn't open the browser in Python

I'm new in Python and I'm trying to use Selenium in Debian but it doesn't work, more concretely it seems to stay in a loop and nothing happens. The next script is the test that I've used:
#!/usr/bin/env python
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://www.python.org')
When I interrupt the script the following text appears:
Traceback (most recent call last):
File "prueba_parseo.py", line 7, in browser =
webdriver.Firefox() File
"/usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/webdriver.py",
line 154, in init
keep_alive=True)
File
"/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py",
line 140, in init
self.start_session(desired_capabilities, browser_profile)
File
"/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py",
line 229, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File
"/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py",
line 295, in execute
response = self.command_executor.execute(driver_command, params)
File
"/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py",
line 464, in execute
return self._request(command_info[0], url, body=data)
File
"/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py",
line 488, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1111, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 400, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
I've been searching for an answer, but nothing works. I've changed the versions of the packages, export no_proxy="localhost,127.0.0.1"
OS: Debian 5
Python: 2.7
Selenium: 3.5
Geckodriver: 0.17.0
Firefox: 52.0
I don't know what else to do or what to change.
Thank you very much!
My guess is that it actually all goes well and the browser is started in background. The reason why it's staying open is probably because of the default option keep_alive=True I can see in your traceback.
Try closing the browser using browser.close() or browser.quit() when you are done with the tests.
From the documentation:
Finally, the browser window is closed. You can also call quit method
instead of close. The quit will exit entire browser whereas close`
will close one tab, but if just one tab was open, by default most
browser will exit entirely.
http://selenium-python.readthedocs.io/getting-started.html#simple-usage

Python request to API keeps returning ZeroReturnError exception

Python 2.7.3
Calling an API from a Raspberry Pi 3, the API logs show it hits the correct endpoint and returns with a 200 status code, but the python code from the Pi spits out a huge error stack. I saw in some forums that the ZeroReturnError is always thrown meaning that there was nothing wrong, but that seems weird since I can't actually get the results of the response in an except block from the try.
My code is literally
import requests
response = requests.get(<URL I AM USING>, json={JSON I AM USING})
Not sure what to do.
Traceback (most recent call last):
File "music.py", line 13, in <module>
response = requests.get(url, json={'blah':{'blah':'*********'}})
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 60, in get
return request('get', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 49, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 457, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 606, in send
r.content
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 724, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 653, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 256, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/usr/lib/python2.7/httplib.py", line 602, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/dist-packages/urllib3/contrib/pyopenssl.py", line 188, in recv
data = self.connection.recv(*args, **kwargs)
OpenSSL.SSL.ZeroReturnError
Some more searching brought me to think it was version issues.
Ran sudo pip install urllib3 --upgrade on the Raspberry Pi and it cleared it up.
I am getting a DependencyWarning about installing PySocks, but its working correctly now.

Using Python to Manipulate HTTP Headers

So I'm trying to use Python to automate 508 compliance checking. There're a few hundred pages on our site, and at the moment a person is actually going through the site every week and tries to enter all the URLs by hand. The UIUC link below checks the request for the referer header and then returns the evaluation of that site. I can't get the request to actually work. I've looked all through SO and can't find anything that helps. The code that is screwy is below and below that the error message.
def fae(urltofae):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
#[('Referer': urltofae)]
r = opener.open('http://www.fae.cita.uiuc.edu/evaluate/link/')
print r
fae("http://www.example.com/")
And the Error:
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in fae
File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
And when I attempt to try to change the referer header (instead of the User-agent) I get formatting errors instead of it even getting to the request even though the format is identical to the one that it didn't complain about for the user-agent.
I'm still very much a new programmer so if I'm missing something blatant then I'm terribly sorry, but I have tried everything I can think of.
Thanks in advance, cheers.
OK so I switched my strategy, and it worked. Unfortunately, I have no idea why the below code worked, and the stuff above kept erroring me, but I have seen a couple of similar-ish questions (no specific answers) around the google so I figured I should post it.
vlz, appreciate the help, cheers.
def faeRequest2(urltofae):
r = urllib2.Request('http://fae.cita.illinois.edu/evaluate/link/', headers={'User-agent':'Mozilla/5.0', 'Referer':urltofae})
c = urllib2.urlopen(r)
print c.read()
I do not see any errory there. Is the url correct? Try using
'http://fae.cita.uiuc.edu/evaluate/link/'
instead of
'http://www.fae.cita.uiuc.edu/evaluate/link/'
The latter seems not to lead anywhere.

Problem exporting an xls file for user download with django and HttpResponse

I'm currently creating a spreadsheet using xlwt and trying to export it out as an HttpResponse in django for a user to download. My code looks like this:
response = HttpResponse(mimetype = "application/vnd.ms-excel")
response['Content-Disposition'] = 'attachment; filename = %s +".xls"' % u'Zinnia_Entries'
work_book.save(response)
return response
Which seems to be the right way to do it, but I'm getting a:
Traceback (most recent call last):
File "C:\dev\workspace-warranty\imcom\imcom\wsgiserver.py", line 1233, in communicate
req.respond()
File "C:\dev\workspace-warranty\imcom\imcom\wsgiserver.py", line 745, in respond
self.server.gateway(self).respond()
File "C:\dev\workspace-warranty\imcom\imcom\wsgiserver.py", line 1927, in respond
response = self.req.server.wsgi_app(self.env, self.start_response)
File "C:\dev\workspace-warranty\3rdparty\django\core\servers\basehttp.py", line 674, in __call__
return self.application(environ, start_response)
File "C:\dev\workspace-warranty\3rdparty\django\core\handlers\wsgi.py", line 252, in __call__
response = middleware_method(request, response)
File "C:\dev\workspace-warranty\imcom\imcom\seo_mod\middleware.py", line 33, in process_response
response.content = strip_spaces_between_tags(response.content.strip())
File "C:\dev\workspace-warranty\3rdparty\django\utils\functional.py", line 259, in wrapper
return func(*args, **kwargs)
File "C:\dev\workspace-warranty\3rdparty\django\utils\html.py", line 89, in strip_spaces_between_tags
return re.sub(r'>\s+<', '><', force_unicode(value))
File "C:\dev\workspace-warranty\3rdparty\django\utils\encoding.py", line 88, in force_unicode
raise DjangoUnicodeDecodeError(s, *e.args)
DjangoUnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: invalid continuation byte. You passed in
(I left off the rest because I get a really long line of this \xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00 kind of stuff)
Do you guys have any ideas on something that could be wrong with this? Is is it because some of my write values look like this:
work_sheet.write(r,#,information) where information isn't cast to unicode?
response['Content-Disposition'] = 'attachment; filename = %s +".xls"' % u'Zinnia_Entries'
should just be
response['Content-Disposition'] = 'attachment; filename = %s.xls' % u'Zinnia_Entries'
without quotes around .xls otherwise the output will be
u'attachment; filename = Zinnia_Entries +".xls"'
So try changing that.
But also check out this answer. It has a really helpful little function for outputing xls files.
django excel xlwt
Solved the problem. Apparently someone had put some funky middleware stuff in that was hacking apart and appending and adding, ect. ect. to the file. When it shouldn't.
Anyway, with it gone the file exports perfectly.
#Storm - Thank you for all the help!