Adding xhr links to scraped categories hrefs missing scheme error - python-2.7

i have built a spider which gets data from one category , the method it follows is when the category page is specified in start url and defining start_requests for pagination which iterates over the link provided by xhr request. Since i wanted to get all the categories at once i have written code like this. my logic was to first get all category links and append those links with xhr links which follows same string for every category which is (?from=24&ajax=true&search_query=&orderby=popular&orderway=asc&latestfilter=&source=menu) and parse these appended url to start_request and iterate them for pagination and item parsing . but i am not able to run spider because it throws the missing scheme error since in start request i havenot provided the http:// i am stuck onto how should i solve this issue please help..
class JabcatSpider(scrapy.Spider):
name = "jabcat"
allowed_domains = ["trendin.com"]
start_urls = [
'http://www.trendin.com',
]
max_pages = 400
def parse(self,response):
urls = response.xpath('//div[#class = "men"]//#href').extract()
for url in urls:
urljoin=(url + "/" "?from=24&ajax=true&search_query=&orderby=popular&orderway=asc&latestfilter=&source=menu")
#yield scrapy.Request(urljoin, callback=self.start_requests)
print urljoin
def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request('?from=%d&ajax=true&search_query=&orderby=popular&orderway=asc&latestfilter=&source=menu' % i, callback=self.parse)
def parse(self, response):
for href in response.xpath('//*[#id="product_rows"]/div/div/div/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
def parse_detail_page(self, response):
for sel in response.xpath('//*[#id="catalog-product"]/section[2]'):
item = Jabongo()
item['title'] = response.xpath('//*[#id="product-details-wrapper"]/div[1]/div[2]/div/div[1]/h1/span[2]/text()').extract()
# item['price'] = response.xpath('//*[#id="pdp-price-info"]/span[2]/text()').extract()
# item['image'] = response.xpath('//*[#class="content"]/h1/span[2]/text()').extract()
# # item['color'] = sel.xpath('//ul/li/label[.="Color"]/following-sibling::Span/text()').extract()
# return item
#pattern = response.xpath('//*[#class="content"]/h1/span[2]/text()').extract

Related

How should I be formatting my yield requests?

My scrapy spider is very confused, or I am, but one of us is not working as intended. My spider pulls start url's from a file and is supposed to: Start on an Amazon search page, crawl the page and grab the url's of each search result, follow the link to the items page, crawl the items page for information on the item, once all items have been crawled on the first page follow pagination up to page X, rinse and repeat.
I am using ScraperAPI and Scrapy-user-agent to randomize my middlewares. I have formatted my start_requests with a priority based on their index in the file, so they should be crawled in order. I have checked and ensured that I AM receiving a successful 200 html response with the actual html from the Amazon page. Here is the code for the spider:
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
page_number = 2
current_keyword = 0
keyword_list = []
payload = {'api_key': 'mykey', 'url':'https://httpbin.org/ip'}
r = requests.get('http://api.scraperapi.com', params=payload)
print(r.text)
#/////////////////////////////////////////////////////////////////////
def start_requests(self):
with open("keywords.txt") as f:
for index, line in enumerate(f):
try:
keyword = line.strip()
AmazonSpiderSpider.keyword_list.append(keyword)
formatted_keyword = keyword.replace(' ', '+')
url = "http://api.scraperapi.com/?api_key=mykey&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2"
yield scrapy.Request(url, meta={'priority': index})
except:
continue
#/////////////////////////////////////////////////////////////////////
def parse(self, response):
print("========== starting parse ===========")
for next_page in response.css("h2.a-size-mini a").xpath("#href").extract():
if next_page is not None:
if "https://www.amazon.com" not in next_page:
next_page = "https://www.amazon.com" + next_page
yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)
second_page = response.css('li.a-last a').xpath("#href").extract_first()
if second_page is not None and AmazonSpiderSpider.page_number < 3:
AmazonSpiderSpider.page_number += 1
yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + second_page, callback=self.parse_pagination)
else:
AmazonSpiderSpider.current_keyword = AmazonSpiderSpider.current_keyword + 1
#/////////////////////////////////////////////////////////////////////
def parse_pagination(self, response):
print("========== starting pagination ===========")
for next_page in response.css("h2.a-size-mini a").xpath("#href").extract():
if next_page is not None:
if "https://www.amazon.com" not in next_page:
next_page = "https://www.amazon.com" + next_page
yield scrapy.Request(
'http://api.scraperapi.com/?api_key=mykey&url=' + next_page,
callback=self.parse_dir_contents)
second_page = response.css('li.a-last a').xpath("#href").extract_first()
if second_page is not None and AmazonSpiderSpider.page_number < 3:
AmazonSpiderSpider.page_number += 1
yield scrapy.Request(
'http://api.scraperapi.com/?api_key=mykey&url=' + second_page,
callback=self.parse_pagination)
else:
AmazonSpiderSpider.current_keyword = AmazonSpiderSpider.current_keyword + 1
#/////////////////////////////////////////////////////////////////////
def parse_dir_contents(self, response):
items = ScrapeAmazonItem()
print("============= parsing page ==============")
temp = response.css('#productTitle::text').extract()
product_name = ''.join(temp)
product_name = product_name.replace('\n', '')
product_name = product_name.strip()
temp = response.css('#priceblock_ourprice::text').extract()
product_price = ''.join(temp)
product_price = product_price.replace('\n', '')
product_price = product_price.strip()
temp = response.css('#SalesRank::text').extract()
product_score = ''.join(temp)
product_score = product_score.strip()
product_score = re.sub(r'\D', '', product_score)
product_ASIN = response.css('li:nth-child(2) .a-text-bold+ span').css('::text').extract()
keyword = AmazonSpiderSpider.keyword_list[AmazonSpiderSpider.current_keyword]
items['product_keyword'] = keyword
items['product_ASIN'] = product_ASIN
items['product_name'] = product_name
items['product_price'] = product_price
items['product_score'] = product_score
yield items
For the FIRST start url, it will crawl three or four items and then it will jump to the SECOND start url. It will skip processing the remaining items and pagination pages, going directly to the second start url. For the second url, it will crawl three or four items, then it again will skip to the THIRD start url. It continues in this way, grabbing three or four items, then skipping to the next URL until it reaches the final start url. It will completely gather all information on this URL. Sometimes the spider COMPLETELY SKIPS the first or second starting url. This happens infrequently, but I have no idea as to what could cause this.
My code for following result item URL's works fine, but I never get the print statement for "starting pagination" so it is not correctly following pages. Also, there is something odd with middlewares. It begins parsing before it has assigned a middleware

Are scrapy definitions (self, response) limited or require special function in order to parse multiple URLs in a loop?

I am parsing two lists of URLs with the latter being populated from the first. I am able to get all the needed URLs, but having trouble pulling data from the last list of URLs. I am only able to get one URL's data
I have tried using scrapy.Request, response.follow, While statement, but only getting a response from one URL. I am new to scrapy/python not to sure how to fix this issue.
import scrapy
class DBSpider(scrapy.Spider):
name = "Scrape"
allowed_domains = ["192.168.3.137"]
start_urls = [
'http://192.168.3.137/api/?controller_id=' + str(x)
for x in range(0,8)
]
def parse(self, response):
for sites in response.xpath('//*
[#id="siteslist"]//li/a/#href').extract():
yield response.follow(sites, self.parse_sites)
def parse_sites(self, response):
for device_pages in response.xpath('//*
[#id="list_devices"]/a/#href').extract():
yield scrapy.Request(response.urljoin(device_pages),
self.parse_macs)
def parse_macs(self, response):
print response.url #only gets one response url
parse_mac only print one response url. should be equivalent to the amount of urls in devices_pages loop in parse_sites def

Scrapy crawler not recursively crawling next page

I am trying to build this crawler to get housing data from craigslist,
but the crawler stops after fetching the first page and does not go to the next page .
Here is the code , it works for the first page ,but for the love of god I dont understand why it does not get to the next page .Any insight is really appreciated .I followed this part from scrapy tutorial
import scrapy
import re
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = "craigslistmm"
start_urls = [
"https://vancouver.craigslist.ca/search/hhh"
]
def parse_second(self,response):
#need all the info in a dict
meta_dict = response.meta
for q in response.css("section.page-container"):
meta_dict["post_details"]= {
"location":
{"longitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-longitude)" ).extract(),
"latitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-latitude)" ).extract()},
"detailed_info": ' '.join(q.css('section#postingbody::text').extract()).strip()
}
return meta_dict
def parse(self, response):
pattern = re.compile("\/([a-z]+)\/([a-z]+)\/.+")
for q in response.css("li.result-row"):
post_urls = q.css("p.result-info a::attr(href)").extract_first()
mm = re.match(pattern, post_urls)
neighborhood= q.css("p.result-info span.result-meta span.result-hood::text").extract_first()
next_url = "https://vancouver.craigslist.ca/"+ post_urls
request = scrapy.Request(next_url,callback=self.parse_second)
#next_page = response.xpath('.//a[#class="button next"]/#href').extract_first()
#follow_url = "https://vancouver.craigslist.ca/" + next_page
#request1 = scrapy.Request(follow_url,callback=self.parse)
#yield response.follow(next_page,callback = self.parse)
request.meta['id'] = q.css("li.result-row::attr(data-pid)").extract_first()
request.meta['pricevaluation'] = q.css("p.result-info span.result-meta span.result-price::text").extract_first()
request.meta["information"] = q.css("p.result-info span.result-meta span.housing::text" ).extract_first()
request.meta["neighborhood"] =q.css("p.result-info span.result-meta span.result-hood::text").extract_first()
request.meta["area"] = mm.group(1)
request.meta["adtype"] = mm.group(2)
yield request
#yield scrapy.Request(follow_url, callback=self.parse)
next_page = LinkExtractor(allow="s=\d+").extract_links(response)[0]
# = "https://vancouver.craigslist.ca/" + next_page
yield response.follow(next_page.url,callback=self.parse)
The problem seems to be with the next_page extraction using LinkExtractor. If you look in the look, you'll see duplicate requests being filtered. There are more links on the page that satisfy your extraction rule and maybe they are not extracted in any particular order (or not in the order you wish).
I think better approach is to extract exactly the information you want, try it with this:
next_page = response.xpath('//span[#class="buttons"]//a[contains(., "next")]/#href').extract_first()

Scrapy: Crawls pages but scrapes 0 items

I am trying to scrape baseball-reference.com. In the Scrapy bot I created, I start from 1st page and navigate to different links and from there to a third link. Please find the code below:
class VisitorBattingSpider(InitSpider):
name = 'VisitorBatting'
year=str(datetime.datetime.today().year)
allowed_domains = ["baseball-reference.com"]
start= 'http://www.baseball-reference.com/boxes/'+year+'.shtml'
start_urls=[start]
#rules = [Rule(LinkExtractor(allow=['/play-index/st.cgi?date=\d+-\d+-\d+']), callback='parse_item',)]
def __init__(self):
BaseSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse(self, response):
self.browser.get(response.url)
# let JavaScript Load
time.sleep(15)
page=Selector(text=self.browser.page_source)
#page=Selector(response)
sites=page.xpath('//*[#id="2016"]/tbody/tr/td/table/tbody/tr/td/a/#href')
for site in sites:
tree = site.extract()
yield Request(url='http://www.baseball-reference.com'+tree,callback=self.parse_new,dont_filter=True)
self.browser.close()
def parse_new(self, response):
hxs=Selector(response)
loads = hxs.xpath('/html/body/pre/a/#href')
for load in loads:
branch=load.extract()
if 'boxes' in branch:
yield Request(url='http://www.baseball-reference.com'+branch,callback=self.parse_final,dont_filter=True)
def parse_final(self, response):
self.browser.get(response.url)
fxs=Selector(text=self.browser.page_source)
vi= fxs.xpath('html/body/div/div[3]/div[1]/div[1]/h3/text()').extract()
vis=''.join(vi)
if "." in vis:
visitor=vis.replace(".","")
else:
visitor=vis
visitor_id=visitor.replace(" ","")
print visitor_id
UR=response.url
URL=''.join(UR)
dtt=URL[-15:]
dt=dtt[:8]
day=datetime.datetime(int(dt[:4]),int(dt[5:6]),int(dt[-2:]),01,01,01).weekday()
path = '//*[#id="'+visitor_id+'batting"]/tfoot/tr'
webs=fxs.xpath(path)
items=[]
for web in webs:
item=VisitorbattingItem()
item['ID']=response.url
item['AWAY_TEAM']=visitor_id
item['GAME_DT']=dt
item['GAME_DY']=day
item['AWAY_GAME']=1
item['AWAY_SCORE_CT']=web.xpath("td[3]/text()").extract()
item['MINUTES_GAME_CT']=fxs.xpath('//*[#id="gametime"]/text()').extract()
item['AWAY_AB']=web.xpath("td[2]/span/text()").extract()
item['AWAY_HITS']=web.xpath("td[4]/text()").extract()
item['AWAY_DO']=fxs.xpath('//*[#id="2Bvisitor"]/text()').extract()
item['AWAY_TR']=fxs.xpath('//*[#id="3Bvisitor"]/text()').extract()
item['AWAY_RBI']=web.xpath("td[5]/text()").extract()
item['AWAY_HBP']=fxs.xpath('//*[#id="HBPvisitor"]/text()').extract()
item['AWAY_SB']=fxs.xpath('//*[#id="SBvisitor"]/text()').extract()
item['AWAY_LOB']=fxs.xpath('//*[#id="teamlobvisitor"]/text()').extract()
item['AWAY_PO']=web.xpath("td[5]/text()").extract()
item['AWAY_ASS']=web.xpath("td[5]/text()").extract()
item['AWAY_ERR']=fxs.xpath('//*[#id="linescore"]/strong[3]/text()').extract
item['AWAY_PB']=fxs.xpath('//*[#id="PBvisitor"]/text()').extract()
item['AWAY_DP']=fxs.xpath('//*[#id="DPvisitor"]/text()').extract()
item['AWAY_TP']=fxs.xpath('//*[#id="TPvisitor"]/text()').extract()
item['AWAY_First_Innings']=fxs.xpath('//*[#id="linescore"]/text()[3]').extract
item['AWAY_IBB']=fxs.xpath('//*[#id="IBBvisitor"]/text()').extract()
item['AWAY_BB']=web.xpath("td[6]/text()").extract()
item['AWAY_SO']=web.xpath("td[7]/text()").extract()
items.append(item)
self.browser.close()
return items
The problem is that, when I execute the script the message I get on my CMD prompt, says crawled 'pages , scraped 0 items'. I don't understand why the items are not being scraped. Any help would be appreciated.

Scrapy get request url in parse

How can I get the request url in Scrapy's parse() function? I have a lot of urls in start_urls and some of them redirect my spider to homepage and as result I have an empty item. So I need something like item['start_url'] = request.url to store these urls. I'm using the BaseSpider.
The 'response' variable that's passed to parse() has the info you want. You shouldn't need to override anything.
eg. (EDITED)
def parse(self, response):
print "URL: " + response.request.url
The request object is accessible from the response object, therefore you can do the following:
def parse(self, response):
item['start_url'] = response.request.url
Instead of storing requested URL's somewhere and also scrapy processed URL's are not in same sequence as provided in start_urls.
By using below,
response.request.meta['redirect_urls']
will give you the list of redirect happened like ['http://requested_url','https://redirected_url','https://final_redirected_url']
To access first URL from above list, you can use
response.request.meta['redirect_urls'][0]
For more, see doc.scrapy.org mentioned as :
RedirectMiddleware
This middleware handles redirection of requests based on response status.
The urls which the request goes through (while being redirected) can be found in the redirect_urls Request.meta key.
Hope this helps you
You need to override BaseSpider's make_requests_from_url(url) function to assign the start_url to the item and then use the Request.meta special keys to pass that item to the parse function
from scrapy.http import Request
# override method
def make_requests_from_url(self, url):
item = MyItem()
# assign url
item['start_url'] = url
request = Request(url, dont_filter=True)
# set the meta['item'] to use the item in the next call back
request.meta['item'] = item
return request
def parse(self, response):
# access and do something with the item in parse
item = response.meta['item']
item['other_url'] = response.url
return item
Hope that helps.
Python 3.5
Scrapy 1.5.0
from scrapy.http import Request
# override method
def start_requests(self):
for url in self.start_urls:
item = {'start_url': url}
request = Request(url, dont_filter=True)
# set the meta['item'] to use the item in the next call back
request.meta['item'] = item
yield request
# use meta variable
def parse(self, response):
url = response.meta['item']['start_url']