My scrapy spider is not working - python-2.7

I tried to make a spider that crawls tripadvisor.in to extract some data but i don't know why its not working. My project name is spidey.Here is the spider i made::
import scrapy
from scrapy.selector import Selector
from spidey.items import tripad
class DmozSpider(scrapy.Spider):
name="spidey"
allowed_domains=["https://www.tripadvisor.in"]
start_urls=['https://www.tripadvisor.in/Attractions-g297604-Activities-Goa.html']
def parse(self, response):
sel=Selector(response)
sites=sel.xpath('//div[#id="FILTERED_LIST"]/div[#class="tmHide"]/div[#class="element_wrap"]/div[#class="wrap al_border attraction_element"]/div[#class="entry al_offer_group"]/div[#class="property_title"]').extract()
items=[]
for site in sites:
item=tripad()
item['name']=site.xpath('//h1[#id="HEADING" class="heading_name"]/text()').extract()
items.append(item)
return items

Well, I will point two errors. There may be more.
As #Rafael said, allowed_domains is wrong.
Indentation is absolutely important in Python. Yours is wrong.
favorite
I tried to make a spider that crawls tripadvisor.in to extract some data but i don't know why its not working. My project name is spidey.Here is the spider i made::
import scrapy
from scrapy.selector import Selector
from spidey.items import tripad
class DmozSpider(scrapy.Spider):
name="spidey"
allowed_domains=["tripadvisor.in"]
start_urls=['https://www.tripadvisor.in/Attractions-g297604-Activities-Goa.html']
def parse(self, response):
sel=Selector(response)
sites=sel.xpath('//div[#id="FILTERED_LIST"]/div[#class="tmHide"]/div[#class="element_wrap"]/div[#class="wrap al_border attraction_element"]/div[#class="entry al_offer_group"]/div[#class="property_title"]').extract()
# I prefer to yield items:
for site in sites:
item=tripad()
item['name']=site.xpath('//h1[#id="HEADING" class="heading_name"]/text()').extract()
yield item

Related

Scrapy scrapes from only some pages -- Crawled(200) (referer: None) Error?

I have written a scrapy project to scrape some data from the Congress.gov website. Originally, I was hoping to scrape the data on all bills. My code ran, and downloaded the data I wanted but only for about 1/2 the bills. So I began troubleshooting. I turned on the autothrottle in the settings, and included middleware code for too many requests. I then limited the search criteria to just a particular congress (97th) for just bills originating in the Senate, and re-ran the code. It downloaded most of the bills, but again some were missing. I then tried to scrape just the pages that were missing. In particular, I tried scraping page 32 I was able to scrape successfully. So why won't it scrape all the pages when I use the recursive code?
Can anyone help me to figure out what the problem is? Here is the code I used to scrape the info from all bills in the 97th congress:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from senatescraper.items import senatescraperSampleItem
from scrapy.http.request import Request
class SenatebotSpider(BaseSpider):
name = 'recursivesenatetablebot2'
allowed_domains = ['www.congress.gov']
def start_requests(self):
baseurl = "https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22chamber%22%3A%22Senate%22%2C%22congress%22%3A%5B%2297%22%5D%2C%22type%22%3A%5B%22bills%22%5D%7D&page="
for i in xrange(1,32):
beginurl= baseurl + `i`
yield Request(beginurl, self.parse_bills)
def parse_bills(self, response):
sel= Selector(response)
bills=sel.xpath("//span[5][#class='result-item']")
for bill in bills:
bill_url=bill.css("span.result-item a::attr(href)").extract()[0]
yield Request(url=bill_url, callback=self.parse_items)
def parse_items(self, response):
sel=Selector(response)
rows=sel.css('table.item_table tbody tr')
items=[]
for row in rows:
item = senatescraperSampleItem()
item['bill']=response.css('h1.legDetail::text').extract()
item['dates']=row.xpath('./td[1]/text()').extract()[0]
item['actions']=row.css('td.actions::text').extract()
item['congress']=response.css('h2.primary::text').extract()
items.append(item)
return items
This is the code I used to just scrape page 32 of search with filters for the 97th congress, bills originating in the senate only:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from startingover.items import startingoverSampleItem
class DebuggingSpider(BaseSpider):
name = 'debugging'
allowed_domains = ['www.congress.gov']
def start_requests(self):
yield scrapy.Request('https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22chamber%22%3A%22Senate%22%2C%22congress%22%3A%5B%2297%22%5D%2C%22type%22%3A%5B%22bills%22%5D%7D&page=32', self.parse_page)
def parse_page(self, response):
sel= Selector(response)
bills=sel.xpath("//span[5][#class='result-item']")
for bill in bills:
bill_url=bill.css("span.result-item a::attr(href)").extract()[0]
yield Request(url=bill_url, callback=self.parse_items)
def parse_items(self, response):
sel=Selector(response)
rows=sel.css('table.item_table tbody tr')
items=[]
for row in rows:
item = startingoverSampleItem()
item['bill']=response.css('h1.legDetail::text').extract()
item['dates']=row.xpath('./td[1]/text()').extract()[0]
item['actions']=row.css('td.actions::text').extract()
item['congress']=response.css('h2.primary::text').extract()
items.append(item)
return items
And my item:
from scrapy.item import Item, Field
class senatescraperSampleItem(Item):
bill=Field()
actions=Field(serializer=str)
congress=Field(serializer=str)
dates=Field()
I think you don't see half of the things you are trying to scrap because you are not taking care of resolving relative urls. Using response.urljoin should remedy the situation.
yield Request(url=response.urljoin(bill_url), callback=self.parse_items)
You may experience this exception:
2018-01-30 17:27:13 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22chamber%22%3A%2
2Senate%22%2C%22congress%22%3A%5B%2297%22%5D%2C%22type%22%3A%5B%22bills%22%5D%7D&page=5> (referer: None)
Traceback (most recent call last):
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/tmp/stackoverflow/senatescraper/senatescraper/spiders/senatespider.py", line 25, in parse_bills
bill_url = bill.css("span.result-item a::attr(href)").extract()[0]
IndexError: list index out of range
To ensure you are getting the URL from the element with text "All actions" and don't caught anything weird that may exists before that element you should combine your xpath query as follows:
def parse_bills(self, response):
sel = Selector(response)
bills = sel.xpath(
'//a[contains(#href, "all-actions")]/#href').extract()
for bill in bills:
yield Request(
url=response.urljoin(bill),
callback=self.parse_items,
dont_filter=True)
Note the dont_filter=True argument, I added it as scrapy was filtering links I already crawled (this is the default configuration). You can removed it if you manage the filtering of duplicate links in a different manner.
When you get exceptions, you can always wrap them around try and except and start the debugging shell of scrapy in the except block, it will help you inspect the response and see what's going on.
I made the following change to my code and it worked perfectly:
def parse_bills(self, response):
bills=Selector(response)
billlinks=bills.xpath('//a[contains(#href,"/all-actions")]/#href')
for link in billlinks:
urllink=link.extract()
yield Request(url=urllink, callback=self.parse_items)

Spider won't run after updating Scrapy

As seems to frequently happen here, I am quite new to Python 2.7 and Scrapy. Our project has us scraping website date, following some links and more scraping, and so on. This was all working fine. Then I updated Scrapy.
Now when I launch my spider, I get the following message:
This wasn't coming up anywhere previously (none of my prior error messages looked anything like this). I am now running scrapy 1.1.0 on Python 2.7. And none of the spiders that had previously worked on this project are working.
I can provide some example code if need be, but my (admittedly limited) knowledge of Python suggests to me that its not even getting to my script before bombing out.
EDIT:
OK, so this code is supposed to start at the first authors page for Deakin University academics on The Conversation, and go through and scrape how many articles they have written and comments they have made.
import scrapy
from ltuconver.items import ConversationItem
from ltuconver.items import WebsitesItem
from ltuconver.items import PersonItem
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import bs4
class ConversationSpider(scrapy.Spider):
name = "urls"
allowed_domains = ["theconversation.com"]
start_urls = [
'http://theconversation.com/institutions/deakin-university/authors']
#URL grabber
def parse(self, response):
requests = []
people = Selector(response).xpath('///*[#id="experts"]/ul[*]/li[*]')
for person in people:
item = WebsitesItem()
item['url'] = 'http://theconversation.com/'+str(person.xpath('a/#href').extract())[4:-2]
self.logger.info('parseURL = %s',item['url'])
requests.append(Request(url=item['url'], callback=self.parseMainPage))
soup = bs4.BeautifulSoup(response.body, 'html.parser')
try:
nexturl = 'https://theconversation.com'+soup.find('span',class_='next').find('a')['href']
requests.append(Request(url=nexturl))
except:
pass
return requests
#go to URLs are grab the info
def parseMainPage(self, response):
person = Selector(response)
item = PersonItem()
item['name'] = str(person.xpath('//*[#id="outer"]/header/div/div[2]/h1/text()').extract())[3:-2]
item['occupation'] = str(person.xpath('//*[#id="outer"]/div/div[1]/div[1]/text()').extract())[11:-15]
item['art_count'] = int(str(person.xpath('//*[#id="outer"]/header/div/div[3]/a[1]/h2/text()').extract())[3:-3])
item['com_count'] = int(str(person.xpath('//*[#id="outer"]/header/div/div[3]/a[2]/h2/text()').extract())[3:-3])
And in my Settings, I have:
BOT_NAME = 'ltuconver'
SPIDER_MODULES = ['ltuconver.spiders']
NEWSPIDER_MODULE = 'ltuconver.spiders'
DEPTH_LIMIT=1
Apparently my six.py file was corrupt (or something like that). After swapping it out with the same file from a colleague, it started working again 8-\

can't crawl all pages in a website

I was trying to crawl all the datas in all the pages . when i try to join the url i can't . I want to know what is the mistake i am doing
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
import urlparse
from data.items import TextPostItem
from scrapy import optional_features
optional_features.remove('boto')
class RedditCrawler(CrawlSpider):
name = 'reddit_crawler'
allowed_domains = ['yellowpages.com']
start_urls = ['http://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=California%2C%20KY']
custom_settings = {
'BOT_NAME': 'reddit-scraper',
'DEPTH_LIMIT': 7,
'DOWNLOAD_DELAY': 3
}
def parse(self, response):
s = Selector(response)
next_link = s.xpath('//a[#class="next ajax-page"]/#href').extract()[0]
full_link = urlparse.urljoin('http://www.yellowpages.com',next_link)
yield self.make_requests_from_url(full_link)
posts = Selector(response).xpath('//div[#class="search-results organic"]')
for post in posts:
item = TextPostItem()
item['address']= post.xpath("//p[#class='adr']//text()").extract()
item['business_name']= post.xpath("//a[#class='business-name']//text()").extract()
item['phonenumber']= post.xpath("//div[#class='phones phone primary']//text()").extract()
item['categories']=post.xpath("//div[#class='categories']//text()").extract()
item['next_link']=post.xpath("//div[#class='pagination']//a[#class='next ajax-page']//#href").extract()
yield item
I think your xpath '//div[#class="next ajax-page"]//ul//li[6]//a/#href' is incorrent. It doesn't work for me.
Try something simpler '//a[#class="next ajax-page"]/#href'

Using scrapy recursivelly for scrape a phpBB forum

I'm trying to use scrapy for crawl a phpbb-based forum. My knowledge level of scrapy is quite basic (but improving).
Extract the contents of a forum thread's first page was more or less easy. My successful scraper was this:
import scrapy
from ptmya1.items import Ptmya1Item
class bastospider3(scrapy.Spider):
name = "basto3"
allowed_domains = ["portierramaryaire.com"]
start_urls = [
"http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
]
def parse(self, response):
for sel in response.xpath('//div[2]/div'):
item = Ptmya1Item()
item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
item['date'] = sel.xpath('div/div[1]/p/text()').extract()
item['body'] = sel.xpath('div/div[1]/div/text()').extract()
yield item
However, when I tried to crawl using "next page" link I have failed after a lot of frustrating hours. I would like to show you my attempts, in order to ask for an advice. Note: I would prefer to obtain a solution for the SgmlLinkExtractor variants, since they are more flexible and powerful, but I priorize success after so many attempts
First one, SgmlLinkExtractor with restricted path. 'Next page xpath' is
/html/body/div[1]/div[2]/form[1]/fieldset/a
Indeed, I tested with the shell that
response.xpath('//div[2]/form[1]/fieldset/a/#href')[1].extract()
returns a correct value for the "next page" link. However, I want to note that the cited xpath offers TWO links
>>> response.xpath('//div[2]/form[1]/fieldset/a/#href').extract()
[u'./search.php?sid=5aa2b92bec28a93c85956e83f2f62c08', u'./viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&sid=5aa2b92bec28a93c85956e83f2f62c08&start=15']
thus, my failed scraper was
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from ptmya1.items import Ptmya1Item
class bastospider3(scrapy.Spider):
name = "basto7"
allowed_domains = ["portierramaryaire.com"]
start_urls = [
"http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[2]/form[1]/fieldset/a/#href')[1],), callback="parse_items", follow= True)
)
def parse_item(self, response):
for sel in response.xpath('//div[2]/div'):
item = Ptmya1Item()
item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
item['date'] = sel.xpath('div/div[1]/p/text()').extract()
item['body'] = sel.xpath('div/div[1]/div/text()').extract()
yield item
Second one, SgmlLinkExtractor with allow. More primitive and unsuccessful too
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from ptmya1.items import Ptmya1Item
class bastospider3(scrapy.Spider):
name = "basto7"
allowed_domains = ["portierramaryaire.com"]
start_urls = [
"http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
]
rules = (
Rule(SgmlLinkExtractor(allow=(r'viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&start.',),), callback="parse_items", follow= True)
)
def parse_item(self, response):
for sel in response.xpath('//div[2]/div'):
item = Ptmya1Item()
item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
item['date'] = sel.xpath('div/div[1]/p/text()').extract()
item['body'] = sel.xpath('div/div[1]/div/text()').extract()
yield item
Finally, I returned to the damn paleolithic age, or to its first tutorial equivalent. I try to use the loop included at the end of the beginner's tutorial. Another failure
import scrapy
import urlparse
from ptmya1.items import Ptmya1Item
class bastospider5(scrapy.Spider):
name = "basto5"
allowed_domains = ["portierramaryaire.com"]
start_urls = [
"http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
]
def parse_articles_follow_next_page(self, response):
item = Ptmya1Item()
item['cacho'] = response.xpath('//div[2]/form[1]/fieldset/a/#href').extract()[1][1:] + "http://portierramaryaire.com/foro"
for sel in response.xpath('//div[2]/div'):
item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
item['date'] = sel.xpath('div/div[1]/p/text()').extract()
item['body'] = sel.xpath('div/div[1]/div/text()').extract()
yield item
next_page = response.xpath('//fieldset/a[#class="right-box right"]')
if next_page:
cadenanext = response.xpath('//div[2]/form[1]/fieldset/a/#href').extract()[1][1:]
url = urlparse.urljoin("http://portierramaryaire.com/foro",cadenanext)
yield scrapy.Request(url, self.parse_articles_follow_next_page)
In all the cases, what I have obtained is a cryptic error message from which I cannot obtain a hint for the solution of my problem.
2015-10-08 21:24:46 [scrapy] DEBUG: Crawled (200) <GET http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a> (referer: None)
2015-10-08 21:24:46 [scrapy] ERROR: Spider error processing <GET http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 76, in parse
raise NotImplementedError
NotImplementedError
2015-10-08 21:24:46 [scrapy] INFO: Closing spider (finished)
I really would appreciate any advice (or better, a working solution) for the problem. I'm utterly stuck on this and no matter how much I read, I am not able to find a solution :(
The cryptic error message occurs because you do not use the parse method. That's the default entry-point of scrapy when it wants to parse a response.
However you only defined a parse_articles_follow_next_page or parse_item function -- which are definitely no parse functions.
And this is not because of the next site but the first site: Scrapy cannot parse the start_url so your tries are not reached in any case. Try to change your parse_items to parse and execute your approaches again for the palaeolithic solution.
If you are using a Rule then you need to use a different spider. For those use CrawlSpider which you can see in the tutorials. In this case do not override the parse method but use the parse_items as you do. That's because CrawlSpider uses parse to forward the responses to the callback method.
Thanks to GHajba, the problem is solved. The solution is developed on the commentaries.
However, the spider doesn't return the results in order. It starts on http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a
and it should walk through "next page" urls, which are like this: http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&start=15
incrementing the 'start' variable with 15 post each time.
Indeed, the spider returns first the page produced 'start=15', then 'start=30', then 'start=0', then again 'start=15', then 'start=45'...
I am not sure if I have to create a new question or if it would be better for future readers to develop the question here. What do you think?
since this is 5 year old - many many new approaches are out there.
btw: see https://github.com/Dascienz/phpBB-forum-scraper
Python-based web scraper for phpBB forums. Project can be used as a
template for building your own custom Scrapy spiders or for one-off
crawls on designated forums. Please keep in mind that aggressive
crawls can produce significant strain on web servers, so please
throttle your request rates.
The phpBB.py spider scrapes the following information from forum
posts: Username User Post Count Post Date & Time Post Text Quoted Text
If you need additional data scraped, you will have to create
additional spiders or edit the existing spider.
Edit phpBB.py and Specify: allowed_domains start_urls username &
password forum_login=False or forum_login=True
see also
import requests
forum = "the forum name"
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'username': 'username', 'password': 'password', 'redirect':'index.php', 'sid':'', 'login':'Login'}
session = requests.Session()
r = session.post(forum + "ucp.php?mode=login", headers=headers, data=payload)
print(r.text)
but wait: we can - instead of manipulating the website using requests,
also make use a browser automation such as mechanize offers this.
This way we don't have to manage the own session and only have a few lines of code to craft each request.
a interesting example is on GitHub https://github.com/winny-/sirsi/blob/317928f23847f4fe85e2428598fbe44c4dae2352/sirsi/sirsi.py#L74-L211

Scrapy Cookie Manipulation How to?

I have to crawl a Web Site, so I use Scrapy to do it, but I need to pass a cookie to bypass the first page (which is a kind of login page, you choose you location)
I heard on the web that you need to do this with a base Spider (not a Crawl Spider), but I need to use a Crawl Spider to do my crawling, so what do I need to do?
At first a Base Spider? then launch my Crawl spider? But I don't know if cookie will be passed between them or how do I do it? How to launch a spider from another spider?
How to handle cookie? I tried with this
def start_requests(self):
yield Request(url='http://www.auchandrive.fr/drive/St-Quentin-985/', cookies={'auchanCook': '"985|"'})
But not working
My answer should be here, but the guy is really evasive and I don't know what to do.
First, you need to add open cookies in settings.py file
COOKIES_ENABLED = True
Here is my testing spider code for your reference. I tested it and passed
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
class Stackoverflow23370004Spider(CrawlSpider):
name = 'auchandrive.fr'
allowed_domains = ["auchandrive.fr"]
target_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"
def start_requests(self):
yield Request(self.target_url,cookies={'auchanCook': "985|"}, callback=self.parse_page)
def parse_page(self, response):
if 'St-Quentin-985' in response.url:
self.log("Passed : %r" % response.url,log.DEBUG)
else:
self.log("Failed : %r" % response.url,log.DEBUG)
You can run command to test and watch the console output:
scrapy crawl auchandrive.fr
I noticed that in your code snippet, you were using cookies={'auchanCook': '"985|"'}, instead of cookies={'auchanCook': "985|"}.
This should get you started:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class AuchanDriveSpider(CrawlSpider):
name = 'auchandrive'
allowed_domains = ["auchandrive.fr"]
# pseudo-start_url
begin_url = "http://www.auchandrive.fr/"
# start URL used as shop selection
select_shop_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[#class="header-menu"]',))),
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(#class, "vignette-content")]',)),
callback='parse_product'),
)
def start_requests(self):
yield Request(self.begin_url, callback=self.select_shop)
def select_shop(self, response):
return Request(url=self.select_shop_url, cookies={'auchanCook': "985|"})
def parse_product(self, response):
self.log("parse_product: %r" % response.url)
Pagination might be tricky.