Python 2.7 - Having trouble downloading large files

Python 2.7 - Having trouble downloading large files - python-2.7

I'm trying to download some decently large files in python 2.7 (between 300 and 700 MB each), and I'm running into the problem of the connection getting reset in the middle of retrieving the files. Specifically, I was using urllib.urlretrieve(url, file_name), and every so often I get socket.error: [Errno 104] Connection reset by peer.
Now, I'm very unfamiliar with how sockets and web protocol works, so I tried the following, not really knowing if it would help:
response = urllib.urlopen(url)
CHUNK_SIZE = 16 * 1024
with open(file_name, 'wb') as f:
for chunk in iter(lambda: response.read(CHUNK_SIZE), ''):
f.write(chunk)
Edit: Guess I should credit the author of this code: https://stackoverflow.com/a/1517728/3002473
It sounds reasonable that we're only downloading a little bit at a time, so it should be "less susceptible" to this Errno 104, but again I know basically nothing about how all of this works so I don't know if this actually makes a difference.
After testing a bit it seems like it works slightly better? But that might just be coincidence. Generally, I'm able to download one, maybe two files before this error gets thrown.
Why am I getting Errno 104, and how can I go about preventing this? Out of curiosity, should I be using urllib2 instead of urllib?

Related

Poloniex & websockets

===SIMPLE & SHORT===
Does anybody have working application that talks with Poloniex through WAMP in these days (January, 2018)?
===MORE SPECIFIC===
I used several info sources to make it work using combo: autobahn-cpp & C++. Windows 10 OS.
I was able to connect to wss://api.poloniex.com, realm1. Plus I was able to subscribe and get subscription ID. But I never got any events even when everything established.
===RESEARCH===
During research in the web I saw a lot of controversial information:
1. Claims, that wss://api2.poloniex.com should be used, and channels names are actually numbers - How to connect to poloniex.com websocket api using a python library
2. This answer gave me base code, but I am getting anything more than just connections, also by following this answer - wss://api.poloniex.com is correct address - Connecting to Poloniex Push-API
3. I saw post (sorry, lost the link), there were comments made that websockets implementation are basically broken on poloniex. They were posted 6 months ago.
===SPECS===
1. Windows 10
2. Autobahn-Cpp
3. wss://api.poloniex.com:443 ; realm1
4. Different subscriptions: ticker, BTC_ETH, 148, 1002, etc..
5. Source code I got from here
===WILL HELP AS WELL===
Is there any way to get all valid subscriptions or, probably, those, that have more than 0 subscribers? I mean, does WAMP have a way to do that?
Is there any known issues with Autobahn-Cpp and poloniex combo?
Is there any simpler way to test WAMP elsewhere to make sure Autobahn isn't a problem? Like any other well documented & supported online projects that accept WAMP websocket communication?

I can receive the correct tick order book data from wss://api2.poloniex.com use python3
but sometime The channel 1002 may stop sending the new tick info.

wss://api.poloniex.com:443 ; realm1
This may be the issue as I've been using api2 and here is the code that works, and has been working for the past 2 quarters non-stop. Its in python, but should be easy enough to port to C++.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import websocket
import json
def on_error(ws, error):
print(error)
def on_close(ws):
print("### closed ###")
connection.close()
def on_open(ws):
print("ONOPEN")
ws.send(json.dumps({'command':'subscribe','channel':'BTC_ETH'}))
def on_message(ws, message):
message = json.loads(message)
print(message)
websocket.enableTrace(True)
ws = websocket.WebSocketApp("wss://api2.poloniex.com/",
on_message = on_message,
on_error = on_error,
on_close = on_close)
ws.on_open = on_open
ws.run_forever()
the code is pretty much self-explanatory (You can check all channels/pairs on Poloniex API website), just save it and run in terminal
python3 fileName.py
should provide You with BTCETH raw stream of orders and trades on console output.
Playing with the message/subscriptions You can then do as You please with it.

It seems that websockets in Poloniex are unstable. Therefore I can stop my attempts make Autobahn-Cpp work with it at least by now and move on.

Simple libtorrent Python client

I tried creating a simple libtorrent python client (for magnet uri), and I failed, the program never continues past the "downloading metadata".
If you may help me write a simple client it would be amazing.
P.S. When I choose a save path, is the save path the folder which I want my data to be saved in? or the path for the data itself.
(I used a code someone posted here)
import libtorrent as lt
import time
ses = lt.session()
ses.listen_on(6881, 6891)
params = {
'save_path': '/home/downloads/',
'storage_mode': lt.storage_mode_t(2),
'paused': False,
'auto_managed': True,
'duplicate_is_error': True}
link = "magnet:?xt=urn:btih:4MR6HU7SIHXAXQQFXFJTNLTYSREDR5EI&tr=http://tracker.vodo.net:6970/announce"
handle = lt.add_magnet_uri(ses, link, params)
ses.start_dht()
print 'downloading metadata...'
while (not handle.has_metadata()):
time.sleep(1)
print 'got metadata, starting torrent download...'
while (handle.status().state != lt.torrent_status.seeding):
s = handle.status()
state_str = ['queued', 'checking', 'downloading metadata', \
'downloading', 'finished', 'seeding', 'allocating']
print '%.2f%% complete (down: %.1f kb/s up: %.1f kB/s peers: %d) %s %.3' % \
(s.progress * 100, s.download_rate / 1000, s.upload_rate / 1000, \
s.num_peers, state_str[s.state], s.total_download/1000000)
time.sleep(5)

What happens it is that the first while loop becomes infinite because the state does not change.
You have to add a s = handle.status (); for having the metadata the status changes and the loop stops. Alternatively add the first while inside the other while so that the same will happen.

Yes, the save path you specify is the one that the torrents will be downloaded to.
As for the metadata downloading part, I would add the following extensions first:
ses.add_extension(lt.create_metadata_plugin)
ses.add_extension(lt.create_ut_metadata_plugin)
Second, I would add a DHT bootstrap node:
ses.add_dht_router("router.bittorrent.com", 6881)
Finally, I would begin debugging the application by seeing if my network interface is binding or if any other errors come up (my experience with BitTorrent download problems, in general, is that they are network related). To get an idea of what's happening I would use libtorrent-rasterbar's alert system:
ses.set_alert_mask(lt.alert.category_t.all_categories)
And make a thread (with the following code) to collect the alerts and display them:
while True:
ses.wait_for_alert(500)
alert = lt_session.pop_alert()
if not alert:
continue
print "[%s] %s" % (type(alert), alert.__str__())
Even with all this working correctly, make sure that torrent you are trying to download actually has peers. Even if there are a few peers, none may be configured correctly or support metadata exchange (exchanging metadata is not a standard BitTorrent feature). Try to load a torrent file (which doesn't require downloading metadata) and see if you can download successfully (to rule out some network issues).

IPv6 destination options header

I'm working on a software-defined networking research project, and what I need is to make a simple UDP server that puts a data tag into the destination options field (IPv6) of the UDP packet. I was expecting to either the sendmsg() recvmsg() commands, or by using setsockopt() and getsockopt(). So, Python 2.7 doesn't have sendmsg() or recvmsg(), and while I can get setsockopt() to correctly load a tag into the packet (I see it in Wireshark), the getsockopt() command just returns a zero, even if the header is there.
#Python 2.7 client
#This code does put the dest opts header onto the packet correctly
#dst_header is a packed binary string (construction details irrelevant--
# it appears correctly formatted and parsed in Wireshark)
addr = ("::", 5000, 0, 0)
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_DSTOPTS, dst_header)
s.sendto('This is my message ', addr)
#Python 2.7 server
addr = ("::", 5000, 0, 0)
s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_RECVDSTOPTS, 1)
s.bind(addr)
data, remote_address = s.recvfrom(MAX)
header_data = s.getsockopt(socket.IPPROTO_IPV6, socket.IPPROTO_DSTOPTS, 1024)
I also tried this in Python 3.4, which does have sendmsg() and recvmsg(), but I just get an error message of "OSError: [Errno 22]: Invalid argument", even though I'm passing it (apparently) correct types:
s.sendmsg(["This is my message"], (socket.IPPROTO_IPV6, socket.IPV6_DSTOPTS, dst_header), 0, addr) #dst_header is same string as for 2.7 version
It looks like 99% of the usage of sendmsg() and recvmsg() is for passing UNIX file descriptors, which isn't what I want to do. Anybody got any ideas? I thought this would be just a four or five line nothing-special program, but I'm stumped.

OK, I'm going to partially answer my own question here, on the off chance that a search engine will bring somebody here with the same issues as I had.
I got the Python 3.4 code working. The problem was not the header, it was the message body. Specifically, both the message body and the header options value fields must be bytes (or bytearray) objects, stored in an iterable container (here, a list). By passing it ["This is my message"] I was sending in a string, not a bytes object; Python let it go, but the OS couldn't cope with that.
You might say I was "byted" by the changes in the handling of strings in Python 3.X...

Django Interrupted system call when sending email

Sometimes, when submitting a form (pretty much any form on my site that sends me an email), I get the following error:
File "/usr/lib/python2.5/smtplib.py", line 603, in starttls
(resp, reply) = self.docmd("STARTTLS")
File "/usr/lib/python2.5/smtplib.py", line 378, in docmd
return self.getreply()
File "/usr/lib/python2.5/smtplib.py", line 352, in getreply
line = self.file.readline()
File "/usr/lib/python2.5/socket.py", line 381, in readline
data = self._sock.recv(self._rbufsize)
error: (4, 'Interrupted system call')
My code is sending email via gmail. I am also using django contact-form which does the same thing.
The problem doesn't always happen. It seems very random. At one point today it got so bad that it displayed the error every time I submitted a form.
Restarting apache fixes the problem for one submission and then it does it again.
I have checked the RAM and there is plenty available (about 350MB available).
Can someone lead me in the right direction? What does this error mean? What can I do to prevent this.
Thanks.

I would say it got to do with a bad network connection to the smtp server.
Looks like it gets interrupted while trying to read the reply from the server?

As a workaround, you may want to try increasing the socket timeout.
As for fixing this, you may not have a stable connection to GMail's server and there may not be a way around this.
It looks as if the EINTR signal is being thrown before the recv call gets any data back.

The recv call used by smtplib is being interrupted by a signal before any data was read. Per the read(2) manpage, POSIX allows a read() that is interrupted after reading some data to return -1 (with errno set to EINTR) or to return the number of bytes already read.
In Python, EINTR raises an IOError: "[Errno 4] Interrupted system call" (EINTR == 4).
An example of how EINTR is properly handled is subprocess.communicate(). See an excellent post here:
http://znasibov.info/blog/post/inside-python-subprocess-communication.html
However, in Python 2.5, socket.readline() does not properly handle EINTR. See:
http://bugs.python.org/issue1628205: socket.readline() interface doesn't handle EINTR properly

Apache lags when responding to gzipped requests

For an application I'm developing, the user submits a gzipped HTTP POST request (content-encoding: GZIP) with multipart form data (content-type: multipart/form-data). I use mod_deflate as an input filter to decompress and the web request is processed in Django via mod_wsgi.
Generally, everything is fine. But for certain requests (deterministic), there is almost a minute lag from request to response. Investigation shows that the processing in django is done immediately, but the response from the server stalls. If the request is not GZIPed, all works well.
Note that to deal with a glitch in mod_wsgi, I set content-length to the uncompressed mesage size.
Has anyone run into this problem? Is there a way to easily debug apache as it processes responses?

What glitch do you believe exists in mod_wsgi?
The simple fact of the matter is that WSGI 1.0 doesn't support mutating input filters which change the content length of the request content. Thus technically you can't use mod_deflate in Apache for request content when using WSGI 1.0. Your setting the content length to be a value other than the actual size is most likely going to stuff up operation of mod_deflate.
If you want to be able to handle compressed request content you need to step outside of WSGI 1.0 specification and use non standard code.
I suggest you have a read of:
http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html
This explains this problem and the suggestions about it.
I'd very much suggest you take this issue over to the official mod_wsgi mailing list for discussion about how you need to write your code. If though you are using one of the Python frameworks however, you are probably going to be restricted in what you can do as they will implement WSGI 1.0 where you can't do this.
UPDATE 1
From discussion on mod_wsgi list, the original WSGI application should be wrapped in following WSGI middleware. This will only work on WSGI adapters that actually provide an empty string as end sentinel for input, something which WSGI 1.0 doesn't require. This should possibly only be used for small uploads as everything is read into memory. If need large compressed uploads, then data when accumulated should be written out to a file instead.
class Wrapper(object):
def __init__(self, application):
self.__application = application
def __call__(self, environ, start_response):
if environ.get('HTTP_CONTENT_ENCODING', '') == 'gzip':
buffer = cStringIO.StringIO()
input = environ['wsgi.input']
blksize = 8192
length = 0
data = input.read(blksize)
buffer.write(data)
length += len(data)
while data:
data = input.read(blksize)
buffer.write(data)
length += len(data)
buffer = cStringIO.StringIO(buffer.getvalue())
environ['wsgi.input'] = buffer
environ['CONTENT_LENGTH'] = length
return self.__application(environ, start_response)
application = Wrapper(original_wsgi_application_callable)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python 2.7 - Having trouble downloading large files - python-2.7

Related

Poloniex & websockets

Simple libtorrent Python client

IPv6 destination options header

Django Interrupted system call when sending email

Apache lags when responding to gzipped requests

Categories

Resources