Scrapy: Return Each Item in a New CSV Row Using Item Loader - python-2.7

I am attempting to produce a csv output of select items contained in a particular class (title, link, price) that parses out each item in its own column, and each instance in its own row using itemloaders and the items module.
I can produce the output using a self-contained spider (without use of items module), however, I'm trying to learn the proper way of detailing the items in the items module, so that I can eventually scale up projects using the proper structure. (I will detail this code as 'Working Row Output Spider Code' below)
I have also attempted to incorporate solutions determined or discussed in related posts; in particular:
Writing Itemloader By Item to XML or CSV Using Scrapy posted by Sam
Scrapy Return Multiple Items posted by Zana Daniel
by using a for loop as he notes at the bottom of the comments section. However, I can get scrapy to accept the for loop, it just doesn't result in any change, that is the items are still grouped in single fields rather than being output into independent rows.
Below is a detail of the code contained in two project attempts --'Working Row Output Spider Code' that does not incorporate items module and items loader, and 'Non Working Row Output Spider Code'-- and the corresponding output of each.
Working Row Output Spider Code: btobasics.py
import scrapy
import urlparse
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['http://http://books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
titles = response.xpath('//*[#class="product_pod"]/h3//text()').extract()
links = response.xpath('//*[#class="product_pod"]/h3/a/#href').extract()
prices = response.xpath('//*[#class="product_pod"]/div[2]/p[1]/text()').extract()
for item in zip(titles, links, prices):
# create a dictionary to store the scraped info
scraped_info = {
'title': item[0],
'link': item[1],
'price': item[2],
}
# yield or give the scraped info to scrapy
yield scraped_info
Run Command to produce CSV: $ scrapy crawl basic -o output.csv
Working Row Output WITHOUT STRUCTURED ITEM LOADERS
Non Working Row Output Spider Code: btobasictwo.py
import datetime
import urlparse
import scrapy
from btobasictwo.items import BtobasictwoItem
from scrapy.loader.processors import MapCompose
from scrapy.loader import ItemLoader
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['http://http://books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
# Create the loader using the response
links = response.xpath('//*[#class="product_pod"]')
for link in links:
l = ItemLoader(item=BtobasictwoItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[#class="product_pod"]/h3//text()',
MapCompose(unicode.strip))
l.add_xpath('link', '//*[#class="product_pod"]/h3/a/#href',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_xpath('price', '//*[#class="product_pod"]/div[2]/p[1]/text()',
MapCompose(unicode.strip))
# Log fields
l.add_value('url', response.url)
l.add_value('date', datetime.datetime.now())
return l.load_item()
Non Working Row Output Items Code: btobasictwo.items.py
from scrapy.item import Item, Field
class BtobasictwoItem(Item):
# Primary fields
title = Field()
link = Field()
price = Field()
# Log fields
url = Field()
date = Field()
Run Command to produce CSV: $ scrapy crawl basic -o output.csv
Non Working Row Code Output WITH STRUCTURED ITEM LOADERS
As you can see, when attempting to incorporate the items module, itemloaders and a for loop to structure the data, it does not seperate the instances by row, but rather puts all instances of a particular item (title, link, price) in 3 fields.
I would greatly appreciate any help on this, and apologize for the lengthy post. I just wanted to document as much as possible so that anyone wanting to assist could run the code themselves, and/or fully appreciate the problem from my documentation. (please leave a comment instructing on length of post if you feel it is not appropriate to be this lengthly).
Thanks very much

You need to tell your ItemLoader to use another selector:
def parse(self, response):
# Create the loader using the response
links = response.xpath('//*[#class="product_pod"]')
for link in links:
l = ItemLoader(item=BtobasictwoItem(), selector=link)
# Load fields using XPath expressions
l.add_xpath('title', './/h3//text()',
MapCompose(unicode.strip))
l.add_xpath('link', './/h3/a/#href',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_xpath('price', './/div[2]/p[1]/text()',
MapCompose(unicode.strip))
# Log fields
l.add_value('url', response.url)
l.add_value('date', datetime.datetime.now())
yield l.load_item()

Related

Python scrapy - yield initial items and items from callback to csv

So I've managed to write a spider that extracts the download links of "Videos" and "English Transcripts" from this site . Looking at the cmd window i can see that all the correct information has been scraped.
The issue I am having is that the output csv file only contains the "Video" links and not the "English Transcripts" links (even though you can see that it's been scraped in the cmd window).
I've tried a few suggestions from other posts but none of them seem to work.
The following picture is how I'd like the output to look like:
CSV Output Picture
this is my current spider code:
import scrapy
class SuhbaSpider(scrapy.Spider):
name = "suhba2"
start_urls = ["http://saltanat.org/videos.php?topic=SheikhBahauddin&gopage={numb}".format(numb=numb)
for numb in range(1,3)]
def parse(self, response):
yield{
"video" : response.xpath("//span[#class='download make-cursor']/a/#href").extract(),
}
fullvideoid = response.xpath("//span[#class='media-info make-cursor']/#onclick").extract()
for videoid in fullvideoid:
url = ("http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2])
yield scrapy.Request(url, callback=self.parse_transcript)
def parse_transcript(self, response):
yield{
"transcript" : response.xpath("//a[contains(#href,'english')]/#href").extract(),
}
You are yielding two different kinds of items - one containing just video attribute and one containing just transcript attribute. You have to yield one kind of item composed of both attributes. For that, you have to create item in parse and pass it to second level request using meta. Then, in the parse_transcript, you take it from meta, fill additional data and finally yield the item. The general pattern is described in Scrapy documentation.
The second thing is that you extract all videos at once using extract() method. This yields a list where it's hard afterwards to link each individual element with corresponding transcript. Better approach is to loop over each individual video element in the HTML and yield item for each video.
Applied to your example:
import scrapy
class SuhbaSpider(scrapy.Spider):
name = "suhba2"
start_urls = ["http://saltanat.org/videos.php?topic=SheikhBahauddin&gopage={numb}".format(numb=numb) for numb in range(1,3)]
def parse(self, response):
for video in response.xpath("//tr[#class='video-doclet-row']"):
item = dict()
item["video"] = video.xpath(".//span[#class='download make-cursor']/a/#href").extract_first()
videoid = video.xpath(".//span[#class='media-info make-cursor']/#onclick").extract_first()
url = "http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2]
request = scrapy.Request(url, callback=self.parse_transcript)
request.meta['item'] = item
yield request
def parse_transcript(self, response):
item = response.meta['item']
item["transcript"] = response.xpath("//a[contains(#href,'english')]/#href").extract_first()
yield item

unable to scrape a text file through a given link and store it in an output file

I'm trying to collect the weather data for year 2000 from this site.
My code for the spider is:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class weather(CrawlSpider):
name = 'data'
start_urls = [
'https://www1.ncdc.noaa.gov/pub/data/uscrn/products/daily01/'
]
custom_settings = {
'DEPTH_LIMIT': '2',
}
rules = (
Rule(LinkExtractor(restrict_xpaths=
('//table/tr[2]/td[1]/a',)),callback='parse_item',follow=True),
)
def parse_item(self, response):
for b in response.xpath('//table')
yield scrapy.request('/tr[4]/td[2]/a/#href').extract()
yield scrapy.request('/tr[5]/td[2]/a/#href').extract()
The paths with 'yield' are the links to two text file and i want to scrape the data from these text files and store it separately in two different files but I don't know how to continue.
I don't usually use CrawlSpider so I'm unfamiliar with it, but it seems like you should be able to create another xpath (preferably something more specific than "/tr[4]/td[2]/a/#href") and supply a callback function.
However, in a "typical" scrapy project using Spider instead of CrawlSpider, you would simply yield a Request with another callback function to handle extracting and storing to the database. For example:
def parse_item(self, response):
for b in response.xpath('//table')
url = b.xpath('/tr[4]/td[2]/a/#href').extract_first()
yield scrapy.Request(url=url, callback=extract_and_store)
url = b.xpath('/tr[5]/td[2]/a/#href').extract_first()
yield scrapy.Request(url=url, callback=extract_and_store)
def extract_and_store(self, response):
"""
Scrape data and store it separately
"""

How to follow next pages in Scrapy Crawler to scrape content

I am able to scrape all the stories from the first page,my problem is how to move to the next page and continue scraping stories and name,kindly check my code below
# -*- coding: utf-8 -*-
import scrapy
from cancerstories.items import CancerstoriesItem
class MyItem(scrapy.Item):
name = scrapy.Field()
story = scrapy.Field()
class MySpider(scrapy.Spider):
name = 'cancerstories'
allowed_domains = ['thebreastcancersite.greatergood.com']
start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/']
def parse(self, response):
rows = response.xpath('//a[contains(#href,"story")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
story_url = response.urljoin(row.xpath('./#href').extract()[0]) # extract url from link
request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
#myItem['name']=response.xpath('//h1[#class="headline"]/text()').extract()
text_raw = response.xpath('//div[#class="photoStoryBox"]/div/p/text()').extract() # extract the story (text)
myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item
You could change your scrapy.Spider for a CrawlSpider, and use Rule and LinkExtractor to follow the link to the next page.
For this approach you have to include the code below:
...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
...
rules = (
Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')),
)
...
class MySpider(CrawlSpider):
...
This way, for each page you visit the spider will create a request for the next page (if present), follow it when finishes the execution for the parse method, and repeat the process again.
EDIT:
The rule I wrote is just to follow the next page link not to extract the stories, if your first approach works it's not necessary to change it.
Also, regarding the rule in your comment, SgmlLinkExtractor is deprecated so I recommend you to use the default link extractor, and the rule itself is not well defined.
When the parameter attrs in the extractor is not defined, it searchs links looking for the href tags in the body, which in this case looks like ../story/mother-of-4435 and not /clickToGive/bcs/story/mother-of-4435. That's the reason it doesn't find any link to follow.
you can follow next pages manually if you would use scrapy.spider class,example:
next_page = response.css('a.pageLink ::attr(href)').extract_first()
if next_page:
absolute_next_page_url = response.urljoin(next_page)
yield scrapy.Request(url=absolute_next_page_url, callback=self.parse)
Do not forget to rename your parse method to parse_start_url if you want to use CralwSpider class

Issues on following links in scrapy

I want to crawl a blog which has several categories of websites . Starting navigating the page from the first category, my goal is to collect every webpage by following the categories . I have collected the websites from the 1st category but the spider stops there , can't reach the 2nd category .
An example draft :
my code :
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from final.items import DmozItem
class my_spider(CrawlSpider):
name = 'heart'
allowed_domains = ['greek-sites.gr']
start_urls = ['http://www.greek-sites.gr/categories/istoselides-athlitismos']
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse', follow=True),)
def parse(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
categories = response.xpath('//a[contains(#href, "categories")]/text()').extract()
for category in categories:
item = DmozItem()
item['title'] = response.xpath('//a[contains(text(),"gr")]/text()').extract()
item['category'] = response.xpath('//div/strong/text()').extract()
return item
The problem is simple: the callback has to be different than parse, so I suggest you name your method parse_site for example and then you are ready to continue your scraping.
If you make the change below it will work:
rules = (Rule(LinkExtractor(allow=(r'.*categories/.*', )), callback='parse_site', follow=True),)
def parse_site(self, response):
The reason for this is described in the docs:
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Trying to extract from the deep node with scrapy, results are bad

As a beginner I'm having a hard time, so I'm here to ask for help.
I'm trying to extract prices from the html page, which are nested deeply:
second price location:
from scrapy.spider import Spider
from scrapy.selector import Selector
from mymarket.items import MymarketItem
class MySpider(Spider):
name = "mymarket"
allowed_domains = ["url"]
start_urls = [
"http://url"
]
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#class="tab_product_list"]//tr')
items = []
for t in titles:
item = MymarketItem()
item["price"] = t.xpath('//tr//span[2]/text()').extract()
items.append(item)
return items
I'm trying to export scraped prices to csv. they do export but are being populated like this:
And I want them to be sorted like this in .csv:
etc.
Can anybody point out where is the faulty part of the xpath or how I can make prices be sorted "properly" ?
It's difficult to say what's wrong with the path. Install firepath extension for Firefox to test your xpath queries. One note for now:
titles = sel.xpath('//table[#class="tab_product_list"]//tr')
In your screenshot you have nested tables, so //tr will give trs from nested tables too.
def parse(self, response):
sel = Selector(response)
titles = sel.xpath('//table[#class="tab_product_list"]/tr') # or with tbody
items = []
for t in titles:
item = MymarketItem()
item["price"] = t.xpath('.//span[#style="color:red;"]/text()').extract()[0]
items.append(item)
return items
.extract() returns a list, even if just one argument found, take the first element of the list .extract()[0]