Scrapy CrawlSpider isn't following the links on a particular page

Scrapy CrawlSpider isn't following the links on a particular page - python-2.7

I have made a spider to crawl a forum that requires a login. I start it off on the login page. The problem occurs with the page that I direct the spider to after the login was successful.
If I open up my rules to accept all links, the spider successfully follows the links on the login page. However, it doesn't follow any of the links on the page that I feed it using Request(). This suggests that it isn't because of screwing up the xpath.
The login appears to work - the page_parse function writes the page source to a text file, and the source is from the page I'm looking for, which can only be reached after logging in. However, the pipeline I have in place to take a screenshot of each page captures the login page, but not this page that I then send it on to.
Here is the spider:
class PLMSpider(CrawlSpider):
name = 'plm'
allowed_domains = ["patientslikeme.com"]
start_urls = [
"https://www.patientslikeme.com/login"
]
rules = (
Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='content-section']")), callback='post_parse', follow=False),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='pagination']")), callback='page_parse', follow=True),
)
def __init__(self, **kwargs):
ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
CrawlSpider.__init__(self, **kwargs)
def post_parse(self, response):
url = response.url
log.msg("Post parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
return item
def page_parse(self, response):
url = response.url
log.msg("Page parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
f = open("body.txt", "w")
f.write(response.body)
f.close()
return item
def login_parse(self, response):
log.msg("Login attempted")
return [FormRequest.from_response(response,
formdata={'userlogin[login]': username, 'userlogin[password]': password},
callback=self.after_login)]
def after_login(self, response):
log.msg("Post login")
if "Login unsuccessful" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return Request(url="https://www.patientslikeme.com/forum/diabetes2/topics",
callback=self.page_parse)
And here is my debug log:
2014-03-21 18:22:05+0000 [scrapy] INFO: Scrapy 0.18.2 started (bot: plm)
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Optional features available: ssl, http11
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'plm.spiders', 'ITEM_PIPELINES': {'plm.pipelines.ScreenshotPipeline': 1}, 'DEPTH_LIMIT': 5, 'SPIDER_MODULES': ['plm.spiders'], 'BOT_NAME': 'plm', 'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled item pipelines: ScreenshotPipeline
2014-03-21 18:22:06+0000 [plm] INFO: Spider opened
2014-03-21 18:22:06+0000 [plm] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-03-21 18:22:07+0000 [scrapy] INFO: Screenshooter initiated
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: None)
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:08+0000 [scrapy] INFO: Login attempted
2014-03-21 18:22:08+0000 [plm] DEBUG: Filtered duplicate request: <GET https://www.patientslikeme.com/login> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-03-21 18:22:09+0000 [plm] DEBUG: Redirecting (302) to <GET https://www.patientslikeme.com/profile/activity/all> from <POST https://www.patientslikeme.com/login>
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/profile/activity/all> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:10+0000 [scrapy] INFO: Post login
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/forum/diabetes2/topics> (referer: https://www.patientslikeme.com/profile/activity/all)
2014-03-21 18:22:10+0000 [scrapy] INFO: Page parse attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:10+0000 [scrapy] INFO: Screenshot attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:15+0000 [plm] DEBUG: Scraped from <200 https://www.patientslikeme.com/forum/diabetes2/topics>
{'url': 'https://www.patientslikeme.com/forum/diabetes2/topics'}
2014-03-21 18:22:15+0000 [plm] INFO: Closing spider (finished)
2014-03-21 18:22:15+0000 [plm] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2068,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 53246,
'downloader/response_count': 5,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 3, 21, 18, 22, 15, 177000),
'item_scraped_count': 1,
'log_count/DEBUG': 13,
'log_count/INFO': 8,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2014, 3, 21, 18, 22, 6, 377000)}
2014-03-21 18:22:15+0000 [plm] INFO: Spider closed (finished)
Thanks for any help you can give.
---- EDIT ----
I have tried to implement Paul t.'s suggestion. Unfortunately, I'm getting the following error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 93, in start
if self.start_crawling():
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 168, in start_crawling
return self.start_crawler() is not None
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 158, in start_crawler
crawler.start()
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1213, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks
result = g.send(result)
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 74, in start
yield self.schedule(spider, batches)
File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 61, in schedule
requests.extend(batch)
exceptions.TypeError: 'Request' object is not iterable
Since it isn't identifying a particular part of the spider that's to blame, I'm struggling to work out where the problem is.
---- EDIT 2 ----
The problem was being caused by the start_requests function provided by Paul t., which used return rather than yield. If I change it to yield, it works perfectly.

My suggestion is to trick CrawlSpider with:
a manually crafted request to the login page,
performing the login,
and only then do as if CrawlSpider was starting with start_urls, using CrawlSpider's "magic"
Here's an illustration of that:
class PLMSpider(CrawlSpider):
name = 'plm'
allowed_domains = ["patientslikeme.com"]
# pseudo-start_url
login_url = "https://www.patientslikeme.com/login"
# start URLs used after login
start_urls = [
"https://www.patientslikeme.com/forum/diabetes2/topics",
]
rules = (
# you want to do the login only once, so it's probably cleaner
# not to ask the CrawlSpider to follow links to the login page
#Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
# you can also deny "/login" to be safe
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='content-section']"),
deny=('/login',)),
callback='post_parse', follow=False),
Rule(SgmlLinkExtractor(restrict_xpaths=("//div[#class='pagination']"),
deny=('/login',)),
callback='page_parse', follow=True),
)
def __init__(self, **kwargs):
ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
CrawlSpider.__init__(self, **kwargs)
# by default start_urls pages will be sent to the parse method,
# but parse() is rather special in CrawlSpider
# so I suggest you create your own initial login request "manually"
# and ask for it to be parsed by your specific callback
def start_requests(self):
yield Request(self.login_url, callback=self.login_parse)
# you've got the login page, send credentials
# (so far so good...)
def login_parse(self, response):
log.msg("Login attempted")
return [FormRequest.from_response(response,
formdata={'userlogin[login]': username, 'userlogin[password]': password},
callback=self.after_login)]
# so we got a response to the login thing
# if we're good,
# just do as if we were starting the crawl now,
# basically doing what happens when you use start_urls
def after_login(self, response):
log.msg("Post login")
if "Login unsuccessful" in response.body:
self.log("Login failed", level=log.ERROR)
return
else:
return [Request(url=u) for u in self.start_urls]
# alternatively, you could even call CrawlSpider's start_requests() method directly
# that's probably cleaner
#return super(PLMSpider, self).start_requests()
def post_parse(self, response):
url = response.url
log.msg("Post parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
return item
def page_parse(self, response):
url = response.url
log.msg("Page parse attempted for {0}".format(url))
item = PLMItem()
item['url'] = url
f = open("body.txt", "w")
f.write(response.body)
f.close()
return item
# if you want the start_urls pages to be parsed,
# you need to tell CrawlSpider to do so by defining parse_start_url attribute
# https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L38
parse_start_url = page_parse

Your login page is parsed by method parse_start_url.
You should redefine the method to parse the login page.
Have a look at documentation.

Related

Scrapy: Record the original URL requested using text file in case of a URL redirect

I am scraping from a list I have in a text file. The website I am scraping has many cases where the URL from the text file redirects to another URL. I would like to be able to record both the original URL in the text file and the redirected URL.
My spider code is as follows:
import datetime
import urlparse
import socket
import scrapy
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from ..items import TermsItem
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]
def parse(self, response):
# Create the loader using the response
l = ItemLoader(item=TermsItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//h1[#class="foo"]/span/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('detail', '//*[#class="bar"]//text()',
MapCompose(unicode.strip))
# Housekeeping fields
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
return l.load_item()
My Items.py is as follows:
from scrapy.item import Item, Field
class TermsItem(Item):
# Primary fields
title = Field()
detail= Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
Do I need to make a 'callback' that somehow associates with the i.strip() from
start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]
and then add a field in items.py to load into the #Housekeeping fields?
I originally tested replacing:
l.add_value('url', response.url)
with
l.add_value('url', response.request.url)
but this produced the same result.
Any help would be much appreciated.
Regards,

You need to use the handle_httpstatus_list property in your spider. Consider the below example
class First(Spider):
name = "redirect"
handle_httpstatus_list = [301, 302, 304, 307]
start_urls = ["http://www.google.com"]
def parse(self, response):
if 300 < response.status < 400:
redirect_to = response.headers['Location'].decode("utf-8")
print(response.url + " is being redirected to " + redirect_to)
# if we need to process this new location we need to yield it ourself
yield response.follow(redirect_to)
else:
print(response.url)
The output of the same is
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.google.com> (referer: None)
http://www.google.com is being redirected to http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw> (referer: None)
http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw is being redirected to https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl
2017-09-21 11:00:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl> (referer: http://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw)
https://www.google.co.in/?gfe_rd=cr&dcr=0&ei=XU7DWYDrNquA8QeT0ZW4Cw&gws_rd=ssl

Scrapy Request with callback not being executed

My spider is running so far so good. Everything works but this bit:
# -*- coding: utf-8 -*-
import scrapy
from info.items import InfoItem
class HeiseSpider(scrapy.Spider):
name = "heise"
start_urls = ['https://www.heise.de/']
def parse(self, response):
print ( "Parse" )
yield scrapy.Request(response.url,callback=self.getSubList)
def getSubList(self,response):
item = InfoItem()
print ( "Sub List: Will it work?" )
yield(scrapy.Request('https://www.test.de/', callback = self.getScore, dont_filter=True))
print ( "Should have" )
yield item
def getScore(self, response):
print ( "--------- Get Score ----------")
print ( response )
return True
The output is:
Will it work?
Should have
Why is getScore not being called?
What am I doing wrong?
Edit: Changed code to a barebone version with the same issue - getScore not being called

Just did a test run and it went through all callbacks as expected:
...
2017-05-13 12:27:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.heise.de/> (referer: None)
Parse
2017-05-13 12:27:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.heise.de/> (referer: https://www.heise.de/)
Sub List: Will it work?
Should have
2017-05-13 12:27:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.test.de/> (referer: https://www.heise.de/)
--------- Get Score ----------
<200 https://www.test.de/>
2017-05-13 12:27:59 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'bool' in <GET https://www.test.de/>
2017-05-13 12:27:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-13 12:27:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 693,
...
Without any logging output and the settings.py missing it's a bit guessing but it's quite likely that in your settings.py is a ROBOTSTXT_OBEY=True.
This means scrapy will respect any limitations that are imposed by robots.txt files and https://www.test.de has a robots.txt that disallows crawling.
So change the ROBOTSTXT line in settings.py to ROBOTSTXT_OBEY=False and it should work.

Scrapy encountered http status <521>

I am new to scrpay, and tried to crawl a website page but was returned http status code <521>
Is it mean the server refuse to be connected? ( i can open it by browser)
I tried to use cookie setting, but still returned with 521.
Question:
what's the reason i met with 521 status code?
is it because of the cookie setting? am i wrong in my code about cookie setting?
how can I crawl this page?
Thank you very much for your help!
The log:
2015-06-07 08:27:26+0800 [scrapy] INFO: Scrapy 0.24.6 started (bot: ccdi)
2015-06-07 08:27:26+0800 [scrapy] INFO: Optional features available: ssl, http11
2015-06-07 08:27:26+0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ccdi.spiders', 'FEED_URI': '412.json', 'SPIDER_MODULES': ['ccdi.spiders'], 'BOT_NAME': 'ccdi', 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3)AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 2}
2015-06-07 08:27:26+0800 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew are
2015-06-07 08:27:27+0800 [scrapy] INFO: Enabled item pipelines:
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider opened
2015-06-07 08:27:27+0800 [ccdi] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-07 08:27:27+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Crawled (521) <GET http://www.ccdi.gov.cn/jlsc/index_2.html> (referer: None)
2015-06-07 08:27:27+0800 [ccdi] DEBUG: Ignoring response <521 http://www.ccdi.gov.cn/jlsc/index_2.html>: HTTP status code is not handled or not allowed
2015-06-07 08:27:27+0800 [ccdi] INFO: Closing spider (finished)
2015-06-07 08:27:27+0800 [ccdi] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 537,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 512,
'downloader/response_count': 1,
'downloader/response_status_count/521': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 468000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 6, 7, 0, 27, 27, 359000)}
2015-06-07 08:27:27+0800 [ccdi] INFO: Spider closed (finished)
My original code:
#encoding: utf-8
import sys
import scrapy
import re
from scrapy.selector import Selector
from scrapy.http.request import Request
from ccdi.items import CcdiItem
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
class CcdiSpider(CrawlSpider):
name = "ccdi"
allowed_domains = ["ccdi.gov.cn"]
start_urls = "http://www.ccdi.gov.cn/jlsc/index_2.html"
#rules = (
# Rule(SgmlLinkExtractor(allow=r"/jlsc/+", ),
# callback="parse_ccdi", follow=True),
#
#)
def start_requests(self):
yield Request(self.start_urls, cookies={'NAME':'Value'},callback=self.parse_ccdi)
def parse_ccdi(self, response):
item = CcdiItem()
self.get_title(response, item)
self.get_url(response, item)
self.get_time(response, item)
self.get_keyword(response, item)
self.get_text(response, item)
return item
def get_title(self,response,item):
title = response.xpath("/html/head/title/text()").extract()
if title:
item['ccdi_title']=title
def get_text(self,response,item):
ccdi_body=response.xpath("//div[#class='TRS_Editor']/div[#class='TRS_Editor']/p/text()").extract()
if ccdi_body:
item['ccdi_body']=ccdi_body
def get_time(self,response,item):
ccdi_time=response.xpath("//em[#class='e e2']/text()").extract()
if ccdi_time:
item['ccdi_time']=ccdi_time[0][5:]
def get_url(self,response,item):
ccdi_url=response.url
if ccdi_url:
print ccdi_url
item['ccdi_url']=ccdi_url
def get_keyword(self,response,item):
ccdi_keyword=response.xpath("//html/head/meta[#http-equiv = 'keywords']/#content").extract()
if ccdi_keyword:
item['ccdi_keyword']=ccdi_keyword

The HTTP status code 521 is a custom error code sent by Cloudflare and usually means that the web server is down: https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#521error
In my case the error did not occur anymore after setting a custom USER_AGENT in my settings.py.
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'crawler (+http://example.com)'

Scrapy is not entering the parse function

I am running the spider below, but it is not entering the parse method, I don't know why, Someone please help.
My code is below
from scrapy.item import Item, Field
from scrapy.selector import Selector
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MyItem(Item):
reviewer_ranking = Field()
print "asdadsa"
class MySpider(BaseSpider):
name = 'myspider'
allowed_domains = ["amazon.com"]
start_urls = ["http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp"]
print"sadasds"
def parse(self, response):
print"fggfggftgtr"
sel = Selector(response)
hxs = HtmlXPathSelector(response)
item = MyItem()
item["reviewer_ranking"] = hxs.select('//span[#class="a-size-small a-color-secondary"]/text()').extract()
return item
The output which I am getting is as below
$ scrapy runspider crawler_reviewers_data.py
asdadsa
sadasds
/home/raj/Documents/IIM A/Daily sales rank/Daily reviews/Reviews_scripts/Scripts_review/Reviews/Reviewer/crawler_reviewers_data.py:18: ScrapyDeprecationWarning: crawler_reviewers_data.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others)
class MySpider(BaseSpider):
2014-06-24 19:21:35+0530 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
2014-06-24 19:21:35+0530 [scrapy] INFO: Optional features available: ssl, http11
2014-06-24 19:21:35+0530 [scrapy] INFO: Overridden settings: {}
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled item pipelines:
2014-06-24 19:21:35+0530 [myspider] INFO: Spider opened
2014-06-24 19:21:35+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-24 19:21:35+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6027
2014-06-24 19:21:35+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6084
2014-06-24 19:21:36+0530 [myspider] DEBUG: Crawled (403) <GET http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp> (referer: None) ['partial']
2014-06-24 19:21:36+0530 [myspider] INFO: Closing spider (finished)
2014-06-24 19:21:36+0530 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 259,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 28487,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 24, 13, 51, 36, 631236),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 6, 24, 13, 51, 35, 472849)}
2014-06-24 19:21:36+0530 [myspider] INFO: Spider closed (finished)
Please help me, i am stuck at this very point.

It is an anti-web-crawling technique used by Amazon - you are getting 403 - Forbidden because it requires User-Agent header to be sent with the request.
One option would be to manually add it to the Request yielded from start_requests():
class MySpider(BaseSpider):
name = 'myspider'
allowed_domains = ["amazon.com"]
def start_requests(self):
yield Request("https://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp",
headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"})
...
Another option would be to set the DEFAULT_REQUEST_HEADERS setting project-wide.
Also note that Amazon provides an API which has a python wrapper, consider using it.
Hope that helps.

Scrapy – cannot store scraped values in a file

I am trying to crawl the web in order to find blogs with Polish or Poland in their titles. I have some problems at the very beginning: my spider is able to scrape my website's title, but doesn't store it in a file when running
scrapy crawl spider -o test.csv -t csv blogseek
Here are my settings:
spider
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from polishblog.items import PolishblogItem
class BlogseekSpider(CrawlSpider):
name = 'blogseek'
start_urls = [
#'http://www.thepolskiblog.co.uk',
#'http://blogs.transparent.com/polish',
#'http://poland.leonkonieczny.com/blog/',
#'http://www.imaginepoland.blogspot.com'
'http://www.normalesup.org/~dthiriet'
]
rules = (
Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
sel = Selector(response)
i = PolishblogItem()
i['titre'] = sel.xpath('//title/text()').extract()
#i['domain_id'] = sel.xpath('//input[#id="sid"]/#value').extract()
#i['name'] = sel.xpath('//div[#id="name"]').extract()
#i['description'] = sel.xpath('//div[#id="description"]').extract()
return i
items.py
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class PolishblogItem(Item):
# define the fields for your item here like:
titre = Field()
#description = Field()
#url = Field()
#pass
When I run
scrapy parse --spider=blogseek -c parse_item -d 2 'http://www.normalesup.org/~dthiriet'
I get the title as scraped. So what's the point? I'd bet it's a silly one but couldn't find the issue. Thanks!
EDIT: maybe there is an issue with feedback. When I run with those settings.py:
# Scrapy settings for polishblog project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'polishblog'
SPIDER_MODULES = ['polishblog.spiders']
NEWSPIDER_MODULE = 'polishblog.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'damien thiriet (+http://www.normalesup.org/~dthiriet)'
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_DELAY=0.25
ROBOTSTXT_OBEY=True
DEPTH_LIMIT=3
#stockage des resultats
FEED_EXPORTERS='CsvItemExporter'
FEED_URI='titresblogs.csv'
FEED_FORMAT='csv'
I get an error message
File /usr/lib/python2.7/site-packages/scrapy/contrib/feedexport.py, line 196, in_load_components
conf.update(self.settings[setting_prefix])
ValueError: dictionary update sequence element #0 has length 1; 2 is required
I installed scrapy that way
pip2.7 install Scrapy
Was I wrong? The doc recommands pip install Scrapy but then I would have python3.4 dependencies installed, I bet this is not the point
EDIT #2:
Here are my logs
2014-06-10 11:00:15+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: polishblog)
2014-06-10 11:00:15+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-06-10 11:00:15+0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'polishblog.spiders', 'FEED_URI': 'stdout:', 'DEPTH_LIMIT': 3, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['polishblog.spiders'], 'BOT_NAME': 'polishblog', 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'damien thiriet (+http://www.normalesup.org/~dthiriet)', 'LOG_FILE': '/tmp/scrapylog', 'DOWNLOAD_DELAY': 0.25}
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-10 11:00:15+0200 [scrapy] INFO: Enabled item pipelines:
2014-06-10 11:00:15+0200 [blogseek] INFO: Spider opened
2014-06-10 11:00:15+0200 [blogseek] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-06-10 11:00:15+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/robots.txt> (referer: None)
2014-06-10 11:00:15+0200 [blogseek] DEBUG: Redirecting (301) to <GET http://www.normalesup.org/~dthiriet/> from <GET http://www.normalesup.org/~dthiriet>
2014-06-10 11:00:16+0200 [blogseek] DEBUG: Crawled (200) <GET http://www.normalesup.org/~dthiriet/> (referer: None)
2014-06-10 11:00:16+0200 [blogseek] INFO: Closing spider (finished)
2014-06-10 11:00:16+0200 [blogseek] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6187,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 10, 9, 0, 16, 166865),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 6, 10, 9, 0, 15, 334634)}
2014-06-10 11:00:16+0200 [blogseek] INFO: Spider closed (finished)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scrapy CrawlSpider isn't following the links on a particular page - python-2.7

Your login page is parsed by method parse_start_url. You should redefine the method to parse the login page. Have a look at documentation.

Related

Scrapy: Record the original URL requested using text file in case of a URL redirect

Scrapy Request with callback not being executed

Scrapy encountered http status <521>

Scrapy is not entering the parse function

Scrapy – cannot store scraped values in a file

Categories

Resources