Scrapy is not entering the parse function - python-2.7

I am running the spider below, but it is not entering the parse method, I don't know why, Someone please help.
My code is below
from scrapy.item import Item, Field
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MyItem(Item):
reviewer_ranking = Field()
print "asdadsa"
class MySpider(BaseSpider):
name = 'myspider'
allowed_domains = ["amazon.com"]
start_urls = ["http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp"]
print"sadasds"
def parse(self, response):
print"fggfggftgtr"
sel = Selector(response)
hxs = HtmlXPathSelector(response)
item = MyItem()
item["reviewer_ranking"] = hxs.select('//span[#class="a-size-small a-color-secondary"]/text()').extract()
return item
The output which I am getting is as below
$ scrapy runspider crawler_reviewers_data.py
asdadsa
sadasds
/home/raj/Documents/IIM A/Daily sales rank/Daily reviews/Reviews_scripts/Scripts_review/Reviews/Reviewer/crawler_reviewers_data.py:18: ScrapyDeprecationWarning: crawler_reviewers_data.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
2014-06-24 19:21:35+0530 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
2014-06-24 19:21:35+0530 [scrapy] INFO: Optional features available: ssl, http11
2014-06-24 19:21:35+0530 [scrapy] INFO: Overridden settings: {}
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled item pipelines:
2014-06-24 19:21:35+0530 [myspider] INFO: Spider opened
2014-06-24 19:21:35+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-24 19:21:35+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6027
2014-06-24 19:21:35+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6084
2014-06-24 19:21:36+0530 [myspider] DEBUG: Crawled (403) <GET http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp> (referer: None) ['partial']
2014-06-24 19:21:36+0530 [myspider] INFO: Closing spider (finished)
2014-06-24 19:21:36+0530 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 28487,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 24, 13, 51, 36, 631236),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 6, 24, 13, 51, 35, 472849)}
2014-06-24 19:21:36+0530 [myspider] INFO: Spider closed (finished)
Please help me, i am stuck at this very point.

It is an anti-web-crawling technique used by Amazon - you are getting 403 - Forbidden because it requires User-Agent header to be sent with the request.
One option would be to manually add it to the Request yielded from start_requests():
class MySpider(BaseSpider):
name = 'myspider'
allowed_domains = ["amazon.com"]
def start_requests(self):
yield Request("https://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp",
headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"})
...
Another option would be to set the DEFAULT_REQUEST_HEADERS setting project-wide.
Also note that Amazon provides an API which has a python wrapper, consider using it.
Hope that helps.

Related

Scrapy get error 'DNS lookup failed' with Celery eventlet enable

I'm using Scrapy in Flask and Celery as a background task.
I start Celery as normal:
celery -A scrapy_flask.celery worker -l info
It works well...
However, I'm going to use WebSocket in scrapy to send data to Web page, so my code is changed in following three place:
socketio = SocketIO(app) -> socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)
import eventlet
eventlet.monkey_patch()
start celery with eventlet enable : celery -A scrapy_flask.celery -P eventlet worker -l info
then the spider get error:Error downloading <GET http://www.XXXXXXX.com/>: DNS lookup failed: address 'www.XXXXXXX.com' not found: timeout error.
and here is my demo code:
# coding=utf-8
import eventlet
eventlet.monkey_patch()
from flask import Flask, render_template
from flask_socketio import SocketIO
from celery import Celery
app = Flask(__name__, template_folder='./')
# Celery configuration
app.config['CELERY_BROKER_URL'] = 'redis://127.0.0.1/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://127.0.0.1/0'
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
SOCKETIO_REDIS_URL = 'redis://127.0.0.1/0'
socketio = SocketIO(app, message_queue=SOCKETIO_REDIS_URL)
from scrapy.crawler import CrawlerProcess
from TestSpider.start_test_spider import settings
from TestSpider.TestSpider.spiders.UpdateTestSpider import UpdateTestSpider
#celery.task
def background_task():
process = CrawlerProcess(settings)
process.crawl(UpdateTestSpider)
process.start() # the script will block here until the crawling is finished
#app.route('/')
def index():
return render_template('index.html')
#app.route('/task')
def start_background_task():
background_task.delay()
return 'Started'
if __name__ == '__main__':
socketio.run(app, host='0.0.0.0', port=9000, debug=True)
here is the logging:
[2016-11-25 09:33:39,319: ERROR/MainProcess] Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
[2016-11-25 09:33:39,320: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] ERROR: Error downloading <GET http://www.XXXXX.com>: DNS lookup failed: address 'www.XXXXX.com' not found: timeout error.
[2016-11-25 09:33:39,420: INFO/MainProcess] Closing spider (finished)
[2016-11-25 09:33:39,421: WARNING/MainProcess] 2016-11-25 09:33:39 [scrapy] INFO: Closing spider (finished)
[2016-11-25 09:33:39,422: INFO/MainProcess] Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
'downloader/request_bytes': 639,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 11, 25, 1, 33, 39, 421501),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 15,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 11, 25, 1, 30, 39, 15207)}

Python scrapy run with Unhandled error in Deferred

windows python version is 2.7.10 scrapy version is 1.0.1
when i run scrapy fetch http://google.com:81 also appear this problem
and i don't konw how to solve it
my code:
items.py:
from scrapy.item import Item, Field
class StackItem(Item):
title = Field()
url = Field()
stack_spider.py
from scrapy import Spider
from scrapy.selector import Selector
class StackSpider(Spider):
name = "stack"
allowed_domains = ["stackoverflow.com"]
start_urls = ["http://stackoverflow.com/questions?pagesize=50&sort=newest", ]
def parse(self, response):
questions = Selector(response).xpath('//div[#class="summary"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath(
'a[#class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath(
'a[#class="question-hyperlink"]/#href').extract()[0]
yield item
the error detail:
$ scrapy crawl stack
2015-07-07 16:26:26 [scrapy] INFO: Scrapy 1.0.1 started (bot: stack)
2015-07-07 16:26:26 [scrapy] INFO: Optional features available: ssl, http11
2015-07-07 16:26:26 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': `'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'BOT_NAME': 'stack'}`
2015-07-07 16:26:27 [scrapy] INFO: Enabled extensions: CloseSpider, `TelnetConsole, LogStats, CoreStats, SpiderState`
Unhandled error in Deferred:
2015-07-07 16:26:28 [twisted] CRITICAL: Unhandled error in Deferred:
2015-07-07 16:26:28 [twisted] CRITICAL:
I think in the stack_spider.py file you are missing a line:
from stack.items import StackItem

Scrapy encountered http status <521>

I am new to scrpay, and tried to crawl a website page but was returned http status code <521>
Is it mean the server refuse to be connected? ( i can open it by browser)
I tried to use cookie setting, but still returned with 521.
Question:
what's the reason i met with 521 status code?
is it because of the cookie setting? am i wrong in my code about cookie setting?
how can I crawl this page?
Thank you very much for your help!
The log:
2015-06-07 08:27:26+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: ccdi)
2015-06-07 08:27:26+0800 [scrapy] INFO: Optional features available: ssl, http11
2015-06-07 08:27:26+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ccdi.spiders', 'FEED_URI': '412.json', 'SPIDER_MODULES': ['ccdi.spiders'], 'BOT_NAME': 'ccdi', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3)AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 2}
2015-06-07 08:27:26+0800 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew are
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled item pipelines:
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider opened
2015-06-07 08:27:27+0800 [ccdi] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Crawled (521) <GET http://www.ccdi.gov.cn/jlsc/index_2.html> (referer: None)
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Ignoring response <521 http://www.ccdi.gov.cn/jlsc/index_2.html>: HTTP status code is not handled or not allowed
2015-06-07 08:27:27+0800 [ccdi] INFO: Closing spider (finished)
2015-06-07 08:27:27+0800 [ccdi] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 537,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 512,
'downloader/response_count': 1,
'downloader/response_status_count/521': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 468000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 359000)}
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider closed (finished)
My original code:
#encoding: utf-8
import sys
import scrapy
import re
from scrapy.selector import Selector
from scrapy.http.request import Request
from ccdi.items import CcdiItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
class CcdiSpider(CrawlSpider):
name = "ccdi"
allowed_domains = ["ccdi.gov.cn"]
start_urls = "http://www.ccdi.gov.cn/jlsc/index_2.html"
#rules = (
# Rule(SgmlLinkExtractor(allow=r"/jlsc/+", ),
# callback="parse_ccdi", follow=True),
#
#)
def start_requests(self):
yield Request(self.start_urls, cookies={'NAME':'Value'},callback=self.parse_ccdi)
def parse_ccdi(self, response):
item = CcdiItem()
self.get_title(response, item)
self.get_url(response, item)
self.get_time(response, item)
self.get_keyword(response, item)
self.get_text(response, item)
return item
def get_title(self,response,item):
title = response.xpath("/html/head/title/text()").extract()
if title:
item['ccdi_title']=title
def get_text(self,response,item):
ccdi_body=response.xpath("//div[#class='TRS_Editor']/div[#class='TRS_Editor']/p/text()").extract()
if ccdi_body:
item['ccdi_body']=ccdi_body
def get_time(self,response,item):
ccdi_time=response.xpath("//em[#class='e e2']/text()").extract()
if ccdi_time:
item['ccdi_time']=ccdi_time[0][5:]
def get_url(self,response,item):
ccdi_url=response.url
if ccdi_url:
print ccdi_url
item['ccdi_url']=ccdi_url
def get_keyword(self,response,item):
ccdi_keyword=response.xpath("//html/head/meta[#http-equiv = 'keywords']/#content").extract()
if ccdi_keyword:
item['ccdi_keyword']=ccdi_keyword
The HTTP status code 521 is a custom error code sent by Cloudflare and usually means that the web server is down: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#521error
In my case the error did not occur anymore after setting a custom USER_AGENT in my settings.py.
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'crawler (+http://example.com)'

Scrapy – cannot store scraped values in a file

I am trying to crawl the web in order to find blogs with Polish or Poland in their titles. I have some problems at the very beginning: my spider is able to scrape my website's title, but doesn't store it in a file when running
scrapy crawl spider -o test.csv -t csv blogseek
Here are my settings:
spider
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from polishblog.items import PolishblogItem
class BlogseekSpider(CrawlSpider):
name = 'blogseek'
start_urls = [
#'http://www.thepolskiblog.co.uk',
#'http://blogs.transparent.com/polish',
#'http://poland.leonkonieczny.com/blog/',
#'http://www.imaginepoland.blogspot.com'
'http://www.normalesup.org/~dthiriet'
]
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = PolishblogItem()
i['titre'] = sel.xpath('//title/text()').extract()
#i['domain_id'] = sel.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = sel.xpath('//div[#id="name"]').extract()
#i['description'] = sel.xpath('//div[#id="description"]').extract()
return i
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class PolishblogItem(Item):
# define the fields for your item here like:
titre = Field()
#description = Field()
#url = Field()
#pass
When I run
scrapy parse --spider=blogseek -c parse_item -d 2 'http://www.normalesup.org/~dthiriet'
I get the title as scraped. So what's the point? I'd bet it's a silly one but couldn't find the issue. Thanks!
EDIT: maybe there is an issue with feedback. When I run with those settings.py:
# Scrapy settings for polishblog project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'polishblog'
SPIDER_MODULES = ['polishblog.spiders']
NEWSPIDER_MODULE = 'polishblog.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'damien thiriet (+http://www.normalesup.org/~dthiriet)'
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_DELAY=0.25
ROBOTSTXT_OBEY=True
DEPTH_LIMIT=3
#stockage des resultats
FEED_EXPORTERS='CsvItemExporter'
FEED_URI='titresblogs.csv'
FEED_FORMAT='csv'
I get an error message
File /usr/lib/python2.7/site-packages/scrapy/contrib/feedexport.py, line 196, in_load_components
conf.update(self.settings[setting_prefix])
ValueError: dictionary update sequence element #0 has length 1; 2 is required
I installed scrapy that way
pip2.7 install Scrapy
Was I wrong? The doc recommands pip install Scrapy but then I would have python3.4 dependencies installed, I bet this is not the point
EDIT #2:
Here are my logs
2014-06-10 11:00:15+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: polishblog)
2014-06-10 11:00:15+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-06-10 11:00:15+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'polishblog.spiders', 'FEED_URI': 'stdout:', 'DEPTH_LIMIT': 3, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['polishblog.spiders'], 'BOT_NAME': 'polishblog', 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'damien thiriet (+http://www.normalesup.org/~dthiriet)', 'LOG_FILE': '/tmp/scrapylog', 'DOWNLOAD_DELAY': 0.25}
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled item pipelines:
2014-06-10 11:00:15+0200 [blogseek] INFO: Spider opened
2014-06-10 11:00:15+0200 [blogseek] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/robots.txt> (referer: None)
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Redirecting (301) to <GET http://www.normalesup.org/~dthiriet/> from <GET http://www.normalesup.org/~dthiriet>
2014-06-10 11:00:16+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/~dthiriet/> (referer: None)
2014-06-10 11:00:16+0200 [blogseek] INFO: Closing spider (finished)
2014-06-10 11:00:16+0200 [blogseek] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6187,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 10, 9, 0, 16, 166865),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 6, 10, 9, 0, 15, 334634)}
2014-06-10 11:00:16+0200 [blogseek] INFO: Spider closed (finished)

Scrapy CrawlSpider isn't following the links on a particular page

I have made a spider to crawl a forum that requires a login. I start it off on the login page. The problem occurs with the page that I direct the spider to after the login was successful.
If I open up my rules to accept all links, the spider successfully follows the links on the login page. However, it doesn't follow any of the links on the page that I feed it using Request(). This suggests that it isn't because of screwing up the xpath.
The login appears to work - the page_parse function writes the page source to a text file, and the source is from the page I'm looking for, which can only be reached after logging in. However, the pipeline I have in place to take a screenshot of each page captures the login page, but not this page that I then send it on to.
Here is the spider:
class PLMSpider(CrawlSpider):
name = 'plm'
allowed_domains = ["patientslikeme.com"]
start_urls = [
"https://www.patientslikeme.com/login"
]
rules = (
Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='content-section']")), callback='post_parse', follow=False),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='pagination']")), callback='page_parse', follow=True),
)
def __init__(self, **kwargs):
ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
CrawlSpider.__init__(self, **kwargs)
def post_parse(self, response):
url = response.url
log.msg("Post parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
return item
def page_parse(self, response):
url = response.url
log.msg("Page parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
f = open("body.txt", "w")
f.write(response.body)
f.close()
return item
def login_parse(self, response):
log.msg("Login attempted")
return [FormRequest.from_response(response,
formdata={'userlogin[login]': username, 'userlogin[password]': password},
callback=self.after_login)]
def after_login(self, response):
log.msg("Post login")
if "Login unsuccessful" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return Request(url="https://www.patientslikeme.com/forum/diabetes2/topics",
callback=self.page_parse)
And here is my debug log:
2014-03-21 18:22:05+0000 [scrapy] INFO: Scrapy 0.18.2 started (bot: plm)
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Optional features available: ssl, http11
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'plm.spiders', 'ITEM_PIPELINES': {'plm.pipelines.ScreenshotPipeline': 1}, 'DEPTH_LIMIT': 5, 'SPIDER_MODULES': ['plm.spiders'], 'BOT_NAME': 'plm', 'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled item pipelines: ScreenshotPipeline
2014-03-21 18:22:06+0000 [plm] INFO: Spider opened
2014-03-21 18:22:06+0000 [plm] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-03-21 18:22:07+0000 [scrapy] INFO: Screenshooter initiated
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: None)
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:08+0000 [scrapy] INFO: Login attempted
2014-03-21 18:22:08+0000 [plm] DEBUG: Filtered duplicate request: <GET https://www.patientslikeme.com/login> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-03-21 18:22:09+0000 [plm] DEBUG: Redirecting (302) to <GET https://www.patientslikeme.com/profile/activity/all> from <POST https://www.patientslikeme.com/login>
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/profile/activity/all> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:10+0000 [scrapy] INFO: Post login
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/forum/diabetes2/topics> (referer: https://www.patientslikeme.com/profile/activity/all)
2014-03-21 18:22:10+0000 [scrapy] INFO: Page parse attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:10+0000 [scrapy] INFO: Screenshot attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:15+0000 [plm] DEBUG: Scraped from <200 https://www.patientslikeme.com/forum/diabetes2/topics>
{'url': 'https://www.patientslikeme.com/forum/diabetes2/topics'}
2014-03-21 18:22:15+0000 [plm] INFO: Closing spider (finished)
2014-03-21 18:22:15+0000 [plm] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2068,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 53246,
'downloader/response_count': 5,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 3, 21, 18, 22, 15, 177000),
'item_scraped_count': 1,
'log_count/DEBUG': 13,
'log_count/INFO': 8,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2014, 3, 21, 18, 22, 6, 377000)}
2014-03-21 18:22:15+0000 [plm] INFO: Spider closed (finished)
Thanks for any help you can give.
---- EDIT ----
I have tried to implement Paul t.'s suggestion. Unfortunately, I'm getting the following error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 93, in start
if self.start_crawling():
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 168, in start_crawling
return self.start_crawler() is not None
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 158, in start_crawler
crawler.start()
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1213, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 74, in start
yield self.schedule(spider, batches)
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 61, in schedule
requests.extend(batch)
exceptions.TypeError: 'Request' object is not iterable
Since it isn't identifying a particular part of the spider that's to blame, I'm struggling to work out where the problem is.
---- EDIT 2 ----
The problem was being caused by the start_requests function provided by Paul t., which used return rather than yield. If I change it to yield, it works perfectly.
My suggestion is to trick CrawlSpider with:
a manually crafted request to the login page,
performing the login,
and only then do as if CrawlSpider was starting with start_urls, using CrawlSpider's "magic"
Here's an illustration of that:
class PLMSpider(CrawlSpider):
name = 'plm'
allowed_domains = ["patientslikeme.com"]
# pseudo-start_url
login_url = "https://www.patientslikeme.com/login"
# start URLs used after login
start_urls = [
"https://www.patientslikeme.com/forum/diabetes2/topics",
]
rules = (
# you want to do the login only once, so it's probably cleaner
# not to ask the CrawlSpider to follow links to the login page
#Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
# you can also deny "/login" to be safe
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='content-section']"),
deny=('/login',)),
callback='post_parse', follow=False),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='pagination']"),
deny=('/login',)),
callback='page_parse', follow=True),
)
def __init__(self, **kwargs):
ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
CrawlSpider.__init__(self, **kwargs)
# by default start_urls pages will be sent to the parse method,
# but parse() is rather special in CrawlSpider
# so I suggest you create your own initial login request "manually"
# and ask for it to be parsed by your specific callback
def start_requests(self):
yield Request(self.login_url, callback=self.login_parse)
# you've got the login page, send credentials
# (so far so good...)
def login_parse(self, response):
log.msg("Login attempted")
return [FormRequest.from_response(response,
formdata={'userlogin[login]': username, 'userlogin[password]': password},
callback=self.after_login)]
# so we got a response to the login thing
# if we're good,
# just do as if we were starting the crawl now,
# basically doing what happens when you use start_urls
def after_login(self, response):
log.msg("Post login")
if "Login unsuccessful" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return [Request(url=u) for u in self.start_urls]
# alternatively, you could even call CrawlSpider's start_requests() method directly
# that's probably cleaner
#return super(PLMSpider, self).start_requests()
def post_parse(self, response):
url = response.url
log.msg("Post parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
return item
def page_parse(self, response):
url = response.url
log.msg("Page parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
f = open("body.txt", "w")
f.write(response.body)
f.close()
return item
# if you want the start_urls pages to be parsed,
# you need to tell CrawlSpider to do so by defining parse_start_url attribute
# https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L38
parse_start_url = page_parse
Your login page is parsed by method parse_start_url.
You should redefine the method to parse the login page.
Have a look at documentation.