Python urllib2 does not respect timeout - python-2.7

The following two lines of code hangs forever:
import urllib2
urllib2.urlopen('https://www.5giay.vn/', timeout=5)
This is with python2.7, and I have no http_proxy or any other env variables set. Any other website works fine. I can also wget the site without any issue. What could be the issue?

If you run
import urllib2
url = 'https://www.5giay.vn/'
urllib2.urlopen(url, timeout=1.0)
wait for a few seconds, and then use C-c to interrupt the program, you'll see
File "/usr/lib/python2.7/ssl.py", line 260, in read
return self._sslobj.read(len)
KeyboardInterrupt
This shows that the program is hanging on self._sslobj.read(len).
SSL timeouts raise socket.timeout.
You can control the delay before socket.timeout is raised by calling
socket.setdefaulttimeout(1.0).
For example,
import urllib2
import socket
socket.setdefaulttimeout(1.0)
url = 'https://www.5giay.vn/'
try:
urllib2.urlopen(url, timeout=1.0)
except IOError as err:
print('timeout')
% time script.py
timeout
real 0m3.629s
user 0m0.020s
sys 0m0.024s
Note that the requests module succeeds here although urllib2 did not:
import requests
r = requests.get('https://www.5giay.vn/')
How to enforce a timeout on the entire function call:
socket.setdefaulttimeout only affects how long Python waits before an exception is raised if the server has not issued a response.
Neither it nor urlopen(..., timeout=...) enforce a time limit on the entire function call.
To do that, you could use eventlets, as shown here.
If you don't want to install eventlets, you could use multiprocessing from the standard library; though this solution will not scale as well as an asynchronous solution such as the one eventlets provides.
import urllib2
import socket
import multiprocessing as mp
def timeout(t, cmd, *args, **kwds):
pool = mp.Pool(processes=1)
result = pool.apply_async(cmd, args=args, kwds=kwds)
try:
retval = result.get(timeout=t)
except mp.TimeoutError as err:
pool.terminate()
pool.join()
raise
else:
return retval
def open(url):
response = urllib2.urlopen(url)
print(response)
url = 'https://www.5giay.vn/'
try:
timeout(5, open, url)
except mp.TimeoutError as err:
print('timeout')
Running this will either succeed or timeout in about 5 seconds of wall clock time.

Related

Scheduled Tasks - Runs without Error but does not produce any output - Django PythonAnywhere

I have setup a scheduled task to run daily on PythonAnywhere.
The task uses the Django Commands as I found this was the preferred method to use with PythonAnywhere.
The tasks produces no errors but I don't get any output. 2022-06-16 22:56:13 -- Completed task, took 9.13 seconds, return code was 0.
I have tried uses Print() to debug areas of the code but I cannot produce any output in either the error or server logs. Even after trying print(date_today, file=sys.stderr).
I have set the path on the Scheduled Task as: (Not sure if this is correct but seems to be the only way I can get it to run without errors.)
workon advancementvenv && python3.8 /home/vicinstofsport/advancement_series/manage.py shell < /home/vicinstofsport/advancement_series/advancement/management/commands/schedule_task.py
I have tried setting the path as below but then it gets an error when I try to import from the models.py file (I know this is related to a relative import but cannot management to resolve it). Traceback (most recent call last): File "/home/vicinstofsport/advancement_series/advancement/management/commands/schedule_task.py", line 3, in <module> from advancement.models import Bookings ModuleNotFoundError: No module named 'advancement'
2022-06-17 03:41:22 -- Completed task, took 14.76 seconds, return code was 1.
Any ideas on how I can get this working? It all works fine locally if I use the command py manage.py scheduled_task just fails on PythonAnywhere.
Below is the task code and structure of the app.
from django.core.management.base import BaseCommand
import requests
from advancement.models import Bookings
from datetime import datetime, timedelta, date
import datetime
from sendgrid import SendGridAPIClient
from sendgrid.helpers.mail import Mail
from django.core.mail import send_mail
import os
from decouple import config
class Command(BaseCommand):
help = 'Sends Program Survey'
def handle(self, *args, **kwargs):
# Get today's date
date_today = datetime.datetime.now().date()
# Get booking data
bookings = Bookings.objects.all()
# For each booking today, send survey email
for booking in bookings:
if booking.booking_date == date_today:
if booking.program_type == "Sport Science":
booking_template_id = 'd-bbc79704a31a4a62a5bfea90f6342b7a'
email = booking.email
booking_message = Mail(from_email=config('FROM_EMAIL'),
to_emails=[email],
)
booking_message.template_id = booking_template_id
try:
sg = SendGridAPIClient(config('SG_API'))
response = sg.send(booking_message)
except Exception as e:
print(e)
else:
booking_template_id = 'd-3167927b3e2146519ff6d9035ab59256'
email = booking.email
booking_message = Mail(from_email=config('FROM_EMAIL'),
to_emails=[email],
)
booking_message.template_id = booking_template_id
try:
sg = SendGridAPIClient(config('SG_API'))
response = sg.send(booking_message)
except Exception as e:
print(e)
else:
print('No')
Thanks in advance for any help.
Thanks Filip and Glenn, testing within the bash console and changing the directory in the task helped to fix the issue. Adding 'cd /home/vicinstofsport/advancement_series && to my task allowed the function to run.'

concurrent celery tasks execute and stores result but .get not working

I have written a Celery Task class like this:
myapp.tasks.py
from __future__ import absolute_import, unicode_literals
from .services.celery import app
from .services.command_service import CommandService
from exceptions.exceptions import *
from .models import Command
class CustomTask(app.Task):
def run(self, json_string, method_name, cmd_id: int):
command_obj = Command.objects.get(id=cmd_id) # type: Command
try:
val = eval('CommandService.{}(json_string={})'.format(method_name, json_string))
status, error = 200, None
except Exception as e:
auto_retry = command_obj.auto_retry
if auto_retry and isinstance(e, CustomError):
command_obj.retry_count += 1
command_obj.save()
return self.retry(countdown=CustomTask._backoff(command_obj.retry_count), exc=e)
elif auto_retry and isinstance(e, AnotherCustomError) and command_obj.retry_count == 0:
command_obj.retry_count += 1
command_obj.save()
print("RETRYING NOW FOR DEVICE CONNECTION ERROR. TRANSACTION: {} || IP: {}".format(command_obj.transaction_id,
command_obj.device_ip))
return self.retry(countdown=command_obj.retry_count*2, exc=e)
val = None
status, error = self._find_status_code(e)
return_dict = {"error": error, "status_code": status, "result": val}
return return_dict
#staticmethod
def _backoff(attempts):
return 2 ** attempts
#staticmethod
def _find_status_code(exception):
if isinstance(exception, APIException):
detail = exception.default_detail if exception.detail is None else exception.detail
return exception.status_code, detail
return 500, CustomTask._get_generic_exc_msg(exception)
#staticmethod
def _get_generic_exc_msg(exc: Exception):
s = ""
try:
for msg in exc.args:
s += msg + ". "
except Exception:
s = str(exc)
return s
CustomTask = app.register_task(CustomTask())
The Celery App definition:
from __future__ import absolute_import, unicode_literals
import os
from celery import Celery, Task
from django.conf import settings
# set the default Django settings module for the 'celery' program.
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myapp.settings')
_celery_broker = settings.CELERY_BROKER <-- my broker is amqp://username:password#localhost:5672/myhost
app = Celery('myapp', broker=_celery_broker, backend='rpc://', include=['myapp.tasks', 'myapp.controllers'])
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks(['myapp'])
app.conf.update(
result_expires=4800,
task_acks_late=True
)
my init.py the tutorial recommended:
from .celery import app as celery_app
__all__ = ['celery_app']
The controller that is running the task:
from __future__ import absolute_import, unicode_literals
from .services.log_service import LogRunner
from myapp.services.command_service import CommandService
from exceptions.exceptions import *
from myapp.services.celery import app
from myapp.services.tasks import MyTask
from .models import Command
class MyController:
def my_method(self, json_string):
<non-async set up stuff here>
cmd_obj = Command.objects.create(<stuff>) # type: Command
task_exec = MyTask.delay(json_string, MyController._method_name, cmd_obj.id)
cmd_obj.task_id = task_exec
try:
return_dict = task_exec.get()
except Exception as e:
self._logger.error("ERROR: IP: {} and transaction: {}. Error Type: {}, "
"Celery Error: {}".format(ip_addr, transaction_id, type(e), e))
status_code, error = self._find_status_code(e)
return_dict = {"error": error, "status_code": status_code, "result": None}
return return_dict
**So here is my issue: **
When I run this Django controller by hitting the view with one request, one after the other, it works perfectly fine.
However, the external service I am hitting will throw an error for 2 concurrent requests (and that is expected - that is ok). Upon getting the error, I retry my task automatically.
Here is the weird part
Upon retry, the .get() I have in my controller stops working for all concurrent requests. My controller just hangs there! And I know that celery is actually executing the task! Here is logs from the celery run:
[2018-09-25 19:10:24,932: INFO/MainProcess] Received task: myapp.tasks.MyTask[bafd62b6-7e29-4c39-86ff-fe903d864c4f]
[2018-09-25 19:10:25,710: INFO/MainProcess] Received task: myapp.tasks.MyTask[8d3b4279-0b7e-48cf-b45d-0f1f89e213d4] <-- THIS WILL FAIL BUT THAT IS OK
[2018-09-25 19:10:25,794: ERROR/ForkPoolWorker-1] Could not connect to device with IP <some ip> at all. Retry Later plase
[2018-09-25 19:10:25,798: WARNING/ForkPoolWorker-1] RETRYING NOW FOR DEVICE CONNECTION ERROR. TRANSACTION: b_txn || IP: <some ip>
[2018-09-25 19:10:25,821: INFO/MainProcess] Received task: myapp.tasks.MyTask[8d3b4279-0b7e-48cf-b45d-0f1f89e213d4] ETA:[2018-09-25 19:10:27.799473+00:00]
[2018-09-25 19:10:25,823: INFO/ForkPoolWorker-1] Task myapp.tasks.MyTask[8d3b4279-0b7e-48cf-b45d-0f1f89e213d4] retry: Retry in 2s: AnotherCustomError('Could not connect to IP <some ip> at all.',)
[2018-09-25 19:10:27,400: INFO/ForkPoolWorker-2] executed command some command at IP <some ip>
[2018-09-25 19:10:27,418: INFO/ForkPoolWorker-2] Task myapp.tasks.MyTask[bafd62b6-7e29-4c39-86ff-fe903d864c4f] succeeded in 2.4829552830196917s: {'error': None, 'status_code': 200, 'result': True}
<some command output here from a successful run> **<-- belongs to task bafd62b6-7e29-4c39-86ff-fe903d864c4f**
[2018-09-25 19:10:31,058: INFO/ForkPoolWorker-2] executed some command at IP <some ip>
[2018-09-25 19:10:31,059: INFO/ForkPoolWorker-2] Task command_runner.tasks.MyTask[8d3b4279-0b7e-48cf-b45d-0f1f89e213d4] succeeded in 2.404364461021032s: {'error': None, 'status_code': 200, 'result': True}
<some command output here from a successful run> **<-- belongs to task 8d3b4279-0b7e-48cf-b45d-0f1f89e213d4 which errored and retried itself**
So as you can see, the task does run on celery! It's just that the .get() I have in my controller is unable to pick these results back up - regardless of successful tasks or the erroneous tasks.
Often times, the error I get when running concurrent requests Error: "Received 0x50 while expecting 0xce". What is that? what does that mean? Again, weirdly enough, all this works when doing one request after another without Django handling multiple incoming requests. Although, I haven't been able to retry for single requests.
The RPC backend (which is what get is waiting for) is designed to fail if it is used more than once or after a celery restart.
a result can only be retrieved once, and only by the client that initiated the task. Two different processes can’t wait for the same result.
The messages are transient (non-persistent) by default, so the results will disappear if the broker restarts. You can configure the result backend to send persistent messages using the result_persistent setting.
So what looks like it is happening is that the exception causes celery to stop and break its rpc connection with the calling controller. Given your use case, it may make more sense to use a permanent results backend like redis or a database.

Running a python subprocess in Flask

I have a Flask web app running a crawling process in this fashion:
on terminal tab 1:
$ cd /path/to/scraping
$ scrapyrt
http://scrapyrt.readthedocs.io/en/latest/index.html
on terminal tab 2:
$ python app.pp
and in app.py:
params = {
'spider_name': spider,
'start_requests':True
}
response = requests.get('http://localhost:9080/crawl.json', params)
print ('RESPONSE',response)
data = json.loads(response.text)
which works.
now I'de like to move everthing into app.py, and for that I've tried:
import subprocess
from time import sleep
try:
subprocess.check_output("scrapyrt", shell=True, cwd='path/to/scraping')
except subprocess.CalledProcessError as e:
raise RuntimeError("command '{}' return with error (code {}): {}".format(e.cmd, e.returncode, e.output))
sleep(3)
params = {
'spider_name': spider,
'start_requests':True
}
response = requests.get('http://localhost:9080/crawl.json', params)
print ('RESPONSE',response)
data = json.loads(response.text)
this starts twisted, like so:
2018-02-03 17:29:35-0200 [-] Log opened.
2018-02-03 17:29:35-0200 [-] Site starting on 9080
2018-02-03 17:29:35-0200 [-] Starting factory <twisted.web.server.Site instance at 0x104effa70>
but crawling process hangs and does not go through.
what am I missing here?
Have you considered using a scheduler, such as APScheduler or similar?
You can run code using crons or in intervals, and it integrates well with Flask.
Take a look here:
http://apscheduler.readthedocs.io/en/stable/userguide.html
Here is an example:
from flask import Flask
from apscheduler.schedulers.background import BackgroundScheduler
app = Flask(__name__)
cron = BackgroundScheduler()
cron.start()
#app.route('/')
def index():
return render_template("index.html")
#cron.scheduled_job('interval', minutes=3)
def my_scrapper():
# Do the stuff you need
return True

Can we replace urlopen in this code with requests library?

Can we replace urlopen library in this example for concurrent requests with the requests library in python 2.7?
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Thanks!
Yes, you can.
Your code seems to do a simple HTTP get with timeout, so the equivalent with requests is:
import requests
def load_url(url, timeout):
r = requests.get(url, timeout=timeout)
return r.content

django tornado getting 403 error on start

Following a tutorial I have written a management command to start tornado and it looks like:
import signal
import time
import tornado.httpserver
import tornado.ioloop
from django.core.management.base import BaseCommand, CommandError
from privatemessages.tornadoapp import application
class Command(BaseCommand):
args = '[port_number]'
help = 'Starts the Tornado application for message handling.'
def sig_handler(self, sig, frame):
"""Catch signal and init callback"""
tornado.ioloop.IOLoop.instance().add_callback(self.shutdown)
def shutdown(self):
"""Stop server and add callback to stop i/o loop"""
self.http_server.stop()
io_loop = tornado.ioloop.IOLoop.instance()
io_loop.add_timeout(time.time() + 2, io_loop.stop)
def handle(self, *args, **options):
if len(args) == 1:
try:
port = int(args[0])
except ValueError:
raise CommandError('Invalid port number specified')
else:
port = 8888
self.http_server = tornado.httpserver.HTTPServer(application)
self.http_server.listen(port, address="127.0.0.1")
# Init signals handler
signal.signal(signal.SIGTERM, self.sig_handler)
# This will also catch KeyboardInterrupt exception
signal.signal(signal.SIGINT, self.sig_handler)
print "start"
tornado.ioloop.IOLoop.instance().start()
print "end"
Here I when I run this management command I get tornado tornado.access:403 GET /2/ (127.0.0.1) 1.92ms error
For test purpose I have printed "start" and "end". I guess when this command is executed successfully "end" should be printed.
Here only "start" is printed not "end". I guess there is something error on tornado.ioloop.IOLoop.instance().start() but I dont know what it is.
Can anyone guide me what is wrong in here ?
You forgot to put self.http_server.start() before starting the ioloop.
...
self.http_server = tornado.httpserver.HTTPServer(application)
self.http_server.listen(port, address="127.0.0.1")
self.http_server.start()
...
Update
Along with the httpserver start missing, are you using this library?
Your management command is exactly this one
Their tutorial is in russian and honestly, I don't read russian.
Update2:
What is see in their code, is that the url /(?P<thread_id>\d+)/ is a websocket handler:
application = tornado.web.Application([
(r"/", MainHandler),
(r'/(?P<thread_id>\d+)/', MessagesHandler),
])
...
class MessagesHandler(tornado.websocket.WebSocketHandler):
...
But the error you posted seems like something tried to access it in http via GET.
Honestly, without a debugger and the same environment, I can't figure out the issue.