Scrapy get error 'DNS lookup failed' with Celery eventlet enable - flask

I'm using Scrapy in Flask and Celery as a background task.
I start Celery as normal:
celery -A scrapy_flask.celery worker -l info
It works well...
However, I'm going to use WebSocket in scrapy to send data to Web page, so my code is changed in following three place:
socketio = SocketIO(app) -> socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)
import eventlet
eventlet.monkey_patch()
start celery with eventlet enable : celery -A scrapy_flask.celery -P eventlet worker -l info
then the spider get error:Error downloading <GET http://www.XXXXXXX.com/>: DNS lookup failed: address 'www.XXXXXXX.com' not found: timeout error.
and here is my demo code:
# coding=utf-8
import eventlet
eventlet.monkey_patch()
from flask import Flask, render_template
from flask_socketio import SocketIO
from celery import Celery
app = Flask(__name__, template_folder='./')
# Celery configuration
app.config['CELERY_BROKER_URL'] = 'redis://127.0.0.1/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://127.0.0.1/0'
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
SOCKETIO_REDIS_URL = 'redis://127.0.0.1/0'
socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)
from scrapy.crawler import CrawlerProcess
from TestSpider.start_test_spider import settings
from TestSpider.TestSpider.spiders.UpdateTestSpider import UpdateTestSpider
#celery.task
def background_task():
process = CrawlerProcess(settings)
process.crawl(UpdateTestSpider)
process.start() # the script will block here until the crawling is finished
#app.route('/')
def index():
return render_template('index.html')
#app.route('/task')
def start_background_task():
background_task.delay()
return 'Started'
if __name__ == '__main__':
socketio.run(app, host='0.0.0.0', port=9000, debug=True)
here is the logging:
[2016-11-25 09:33:39,319: ERROR/MainProcess] Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
[2016-11-25 09:33:39,320: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] ERROR: Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
[2016-11-25 09:33:39,420: INFO/MainProcess] Closing spider (finished)
[2016-11-25 09:33:39,421: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] INFO: Closing spider (finished)
[2016-11-25 09:33:39,422: INFO/MainProcess] Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
'downloader/request_bytes': 639,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 11, 25, 1, 33, 39, 421501),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 15,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 11, 25, 1, 30, 39, 15207)}

Related

How do I test that my Celery worker actually works in Django

(code at bottom)
Context: I'm working on a Django project where I need to provide the user feedback on a task that takes 15-45 seconds. In comes Celery to the rescue! I can see that Celery is performing as expected when I celery -A my_project worker -l info & python manage.py runserver.
Problem: I can't figure out how to run a celery worker in my tests. When I run python manage.py test, I get the following error:
Traceback (most recent call last):
File "/Users/pbrockman/coding/t1v/lib/python3.8/site-packages/django/test/utils.py", line 387, in inner
return func(*args, **kwargs)
File "/Users/pbrockman/coding/tcommerce/tcommerce/tests.py", line 58, in test_shared_celery_task
self.assertEqual(result.get(), 6)
File "/Users/pbrockman/coding/t1v/lib/python3.8/site-packages/celery/result.py", line 224, in get
return self.backend.wait_for_pending(
File "/Users/pbrockman/coding/t1v/lib/python3.8/site-packages/celery/backends/base.py", line 756, in wait_for_pending
meta = self.wait_for(
File "/Users/pbrockman/coding/t1v/lib/python3.8/site-packages/celery/backends/base.py", line 1087, in _is_disabled
raise NotImplementedError(E_NO_BACKEND.strip())
NotImplementedError: No result backend is configured.
Please see the documentation for more information.
Attempted solution:
I tried various combinations of #override_settings with CELERY_TASK_ALWAYS_EAGER=True, CELERY_TASK_EAGER_PROPOGATES=True, and BROKER_BACKEND='memory'.
I tried both #app.task decorator and the #shared_task decorator.
How do I see if celery is having the expected behavior in my tests?
Code
Celery Settings: my_project/celery.py
import os
from dotenv import load_dotenv
load_dotenv()
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'my_project.settings')
app = Celery('my_project-{os.environ.get("ENVIRONMENT")}',
broker=os.environ.get('REDISCLOUD_URL'),
include=['my_project.tasks'])
from django.conf import settings
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
if __name__ == '__main__':
app.start()
Testing: my_project/tests.py
from django.test import TestCase
from tcommerce.celery import app
from tcommerce.tasks import shared_add
from tcommerce.tasks import app_add
class CeleryTests(TestCase):
def test_shared_celery_task(self):
'#shared_task'
result = shared_add.delay(2, 4)
self.assertEqual(result.get(), 6)
def test_app_celery_task(self):
'#task.app'
result = app_add.delay(2, 4)
self.assertEqual(result.get(), 6)
Defining tasks: my_project/tasks.py
from .celery import app
from celery import shared_task
#shared_task
def shared_add(x, y):
return x + y
#app.task
def app_add(x, y):
return x + y

Why does Flask + SocketIO + Gevent give me SSL EOF errors?

This is a simple code snippet that consistently repeats the issue I'm having. I'm using Python 2.7.12, Flask 0.11, Flask-SocketIO 2.7.1, and gevent 1.1.2. I understand that this is probably an issue better brought up to the responsible package's mailing list, but I can't figure out which one is responsible. However, I'm pretty sure it is a problem with gevent because that's what raises the exception.
from flask import Flask
from flask_socketio import SocketIO
from gevent import monkey
monkey.patch_all()
import ssl
app = Flask(__name__)
app.config['SECRET_KEY'] = 'secret'
socketio = SocketIO(app, async_mode='gevent')
#app.route('/')
def index():
return "Hello World!"
#socketio.on('connect')
def handle_connect_event():
print('Client connected')
if __name__ == '__main__':
socketio.run(app, host='127.0.0.1', port=8443,
certfile='ssl/server/server.cer', keyfile='ssl/server/server.key',
ca_certs='ssl/server/ca.cer', cert_reqs=ssl.CERT_REQUIRED,
ssl_version=ssl.PROTOCOL_TLSv1_2)
And here is the error I get when the client connects:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/gevent/greenlet.py", line 534, in
result = self._run(*self.args, **self.kwargs)
File "/usr/lib/python2.7/site-packages/gevent/baseserver.py", line 25, in
return handle(*args_tuple)
File "/usr/lib/python2.7/site-packages/gevent/server.py", line 126, in wr
ssl_socket = self.wrap_socket(client_socket, **self.ssl_args)
File "/usr/lib/python2.7/site-packages/gevent/_sslgte279.py", line 691, i
ciphers=ciphers)
File "/usr/lib/python2.7/site-packages/gevent/_sslgte279.py", line 271, i
raise x
SSLEOFError: EOF occurred in violation of protocol (_ssl.c:590)
<Greenlet at 0x7fdd593c94b0: _handle_and_close_when_done(<bound method WSGInd method WSGIServer.do_close of <WSGIServer a, (<socket at 0x7fdd590f4410 SSLEOFError
My system also has OpenSSL version 1.0.2.j if that helps. Any thoughts would be appreciated!
Use patch_all on top of the code. Even before flask and socketio import.
from gevent import monkey
monkey.patch_all()
from flask import Flask
from flask_socketio import SocketIO
import ssl

Raising Error: NotRegistered when I use Flask with Celery

Description
Hi, I'm learning Celery, and I read a blog.>>
Celery and the Flask Application Factory Pattern - miguelgrinberg.com
So I wrote a small program to run Flask with Celery
Code
app.__init__.py
from flask import Flask
from celery import Celery
celery = Celery(__name__, broker='amqp://127.0.0.1:5672/')
def create_app():
app = Flask(__name__)
#celery.task
def add(x, y):
print x+y
#app.route('/')
def index():
add.delay(1, 3)
return 'Hello World!'
return app
manage.py
from app import create_app
app = create_app()
if __name__ == '__main__':
app.run()
celery_worker_1.py
from app import celery, create_app()
f_app = create_app()
f_app.app_context().push()
celery_worker_2.py
from app import celery, create_app
#celery.task
def foo():
print 'Balabala...'
f_app = create_app()
f_app.app_context().push()
Problem
When I run the Flask server and celery useing:
celery -A celery_worker_1 worker -l
the Celery raised NotRegistered Error:
Traceback (most recent call last): File "D:\Python27\lib\site-packages\billiard\pool.py", line 363, in workloop
result = (True, prepare_result(fun(*args, **kwargs))) File "D:\Python27\lib\site-packages\celery\app\trace.py", line 349, in
_fast_trace_task
return _tasks[task].__trace__(uuid, args, kwargs, request)[0] File "D:\Python27\lib\site-packages\celery\app\registry.py", line 26, in __missing__
raise self.NotRegistered(key) NotRegistered: 'app.add'
But instead of using celery_worker_2:
celery -A celery_worker_2 worker -l info
the task run correctly:
[2015-11-28 15:45:56,299: INFO/MainProcess] Received task: app.add[cbe5e1d6-c5df-4141-9db1-e6313517c202]
[2015-11-28 15:45:56,302: WARNING/Worker-1] 4
[2015-11-28 15:45:56,371: INFO/MainProcess] Task app.add[cbe5e1d6-c5df-4141-9db1-e6313517c202] succeeded in 0.0699999332428s: None
Why can't the Celery run correctly with the code of celery_worker_1?
PS: I'm not good at English, you can point it out to me if you can't understand, I'd like to describe again. ThankS!

Scrapy encountered http status <521>

I am new to scrpay, and tried to crawl a website page but was returned http status code <521>
Is it mean the server refuse to be connected? ( i can open it by browser)
I tried to use cookie setting, but still returned with 521.
Question:
what's the reason i met with 521 status code?
is it because of the cookie setting? am i wrong in my code about cookie setting?
how can I crawl this page?
Thank you very much for your help!
The log:
2015-06-07 08:27:26+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: ccdi)
2015-06-07 08:27:26+0800 [scrapy] INFO: Optional features available: ssl, http11
2015-06-07 08:27:26+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ccdi.spiders', 'FEED_URI': '412.json', 'SPIDER_MODULES': ['ccdi.spiders'], 'BOT_NAME': 'ccdi', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3)AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 2}
2015-06-07 08:27:26+0800 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew are
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled item pipelines:
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider opened
2015-06-07 08:27:27+0800 [ccdi] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Crawled (521) <GET http://www.ccdi.gov.cn/jlsc/index_2.html> (referer: None)
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Ignoring response <521 http://www.ccdi.gov.cn/jlsc/index_2.html>: HTTP status code is not handled or not allowed
2015-06-07 08:27:27+0800 [ccdi] INFO: Closing spider (finished)
2015-06-07 08:27:27+0800 [ccdi] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 537,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 512,
'downloader/response_count': 1,
'downloader/response_status_count/521': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 468000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 359000)}
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider closed (finished)
My original code:
#encoding: utf-8
import sys
import scrapy
import re
from scrapy.selector import Selector
from scrapy.http.request import Request
from ccdi.items import CcdiItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
class CcdiSpider(CrawlSpider):
name = "ccdi"
allowed_domains = ["ccdi.gov.cn"]
start_urls = "http://www.ccdi.gov.cn/jlsc/index_2.html"
#rules = (
# Rule(SgmlLinkExtractor(allow=r"/jlsc/+", ),
# callback="parse_ccdi", follow=True),
#
#)
def start_requests(self):
yield Request(self.start_urls, cookies={'NAME':'Value'},callback=self.parse_ccdi)
def parse_ccdi(self, response):
item = CcdiItem()
self.get_title(response, item)
self.get_url(response, item)
self.get_time(response, item)
self.get_keyword(response, item)
self.get_text(response, item)
return item
def get_title(self,response,item):
title = response.xpath("/html/head/title/text()").extract()
if title:
item['ccdi_title']=title
def get_text(self,response,item):
ccdi_body=response.xpath("//div[#class='TRS_Editor']/div[#class='TRS_Editor']/p/text()").extract()
if ccdi_body:
item['ccdi_body']=ccdi_body
def get_time(self,response,item):
ccdi_time=response.xpath("//em[#class='e e2']/text()").extract()
if ccdi_time:
item['ccdi_time']=ccdi_time[0][5:]
def get_url(self,response,item):
ccdi_url=response.url
if ccdi_url:
print ccdi_url
item['ccdi_url']=ccdi_url
def get_keyword(self,response,item):
ccdi_keyword=response.xpath("//html/head/meta[#http-equiv = 'keywords']/#content").extract()
if ccdi_keyword:
item['ccdi_keyword']=ccdi_keyword
The HTTP status code 521 is a custom error code sent by Cloudflare and usually means that the web server is down: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#521error
In my case the error did not occur anymore after setting a custom USER_AGENT in my settings.py.
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'crawler (+http://example.com)'

Scrapy – cannot store scraped values in a file

I am trying to crawl the web in order to find blogs with Polish or Poland in their titles. I have some problems at the very beginning: my spider is able to scrape my website's title, but doesn't store it in a file when running
scrapy crawl spider -o test.csv -t csv blogseek
Here are my settings:
spider
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from polishblog.items import PolishblogItem
class BlogseekSpider(CrawlSpider):
name = 'blogseek'
start_urls = [
#'http://www.thepolskiblog.co.uk',
#'http://blogs.transparent.com/polish',
#'http://poland.leonkonieczny.com/blog/',
#'http://www.imaginepoland.blogspot.com'
'http://www.normalesup.org/~dthiriet'
]
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = PolishblogItem()
i['titre'] = sel.xpath('//title/text()').extract()
#i['domain_id'] = sel.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = sel.xpath('//div[#id="name"]').extract()
#i['description'] = sel.xpath('//div[#id="description"]').extract()
return i
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class PolishblogItem(Item):
# define the fields for your item here like:
titre = Field()
#description = Field()
#url = Field()
#pass
When I run
scrapy parse --spider=blogseek -c parse_item -d 2 'http://www.normalesup.org/~dthiriet'
I get the title as scraped. So what's the point? I'd bet it's a silly one but couldn't find the issue. Thanks!
EDIT: maybe there is an issue with feedback. When I run with those settings.py:
# Scrapy settings for polishblog project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'polishblog'
SPIDER_MODULES = ['polishblog.spiders']
NEWSPIDER_MODULE = 'polishblog.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'damien thiriet (+http://www.normalesup.org/~dthiriet)'
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_DELAY=0.25
ROBOTSTXT_OBEY=True
DEPTH_LIMIT=3
#stockage des resultats
FEED_EXPORTERS='CsvItemExporter'
FEED_URI='titresblogs.csv'
FEED_FORMAT='csv'
I get an error message
File /usr/lib/python2.7/site-packages/scrapy/contrib/feedexport.py, line 196, in_load_components
conf.update(self.settings[setting_prefix])
ValueError: dictionary update sequence element #0 has length 1; 2 is required
I installed scrapy that way
pip2.7 install Scrapy
Was I wrong? The doc recommands pip install Scrapy but then I would have python3.4 dependencies installed, I bet this is not the point
EDIT #2:
Here are my logs
2014-06-10 11:00:15+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: polishblog)
2014-06-10 11:00:15+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-06-10 11:00:15+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'polishblog.spiders', 'FEED_URI': 'stdout:', 'DEPTH_LIMIT': 3, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['polishblog.spiders'], 'BOT_NAME': 'polishblog', 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'damien thiriet (+http://www.normalesup.org/~dthiriet)', 'LOG_FILE': '/tmp/scrapylog', 'DOWNLOAD_DELAY': 0.25}
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled item pipelines:
2014-06-10 11:00:15+0200 [blogseek] INFO: Spider opened
2014-06-10 11:00:15+0200 [blogseek] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/robots.txt> (referer: None)
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Redirecting (301) to <GET http://www.normalesup.org/~dthiriet/> from <GET http://www.normalesup.org/~dthiriet>
2014-06-10 11:00:16+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/~dthiriet/> (referer: None)
2014-06-10 11:00:16+0200 [blogseek] INFO: Closing spider (finished)
2014-06-10 11:00:16+0200 [blogseek] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6187,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 10, 9, 0, 16, 166865),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 6, 10, 9, 0, 15, 334634)}
2014-06-10 11:00:16+0200 [blogseek] INFO: Spider closed (finished)