Aggregate data from list of Requests in a crawler - python-2.7

Perhaps I am missing something simple, so I hope this is an easy question. I am using Scrapy to parse a directory listing and then pull down each appropriate web page (actually a text file) and parse it out using Python.
Each page has a set of data I am interested in, and I update a global dictionary each time I encounter such an item in each page.
What I would like to do is simply print out an aggregate summary when all Request calls are complete, however after the yield command nothing runs. I am assuming because yield is actually returning a generator and bailing out.
I'd like to avoid writing a file for each Request object if possible...I'd rather keep it self contained within this Python script.
Here is the code I am using:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
slain = {}
class GameSpider(BaseSpider):
name = "game"
allowed_domains = ["example.org"]
start_urls = [
"http://example.org/users/deaths/"
]
def parse(self, response):
links = response.xpath('//a/#href').extract()
for link in links:
if 'txt' in link:
l = self.start_urls[0] + link
yield Request(l, callback=self.parse_following, dont_filter=True)
# Ideally print out the aggregate after all Requests are satisfied
# print "-----"
# for k,v in slain.iteritems():
# print "Slain by %s: %d" % (k,v)
# print "-----"
def parse_following(self, response):
parsed_resp = response.body.rstrip().split('\n')
for line in parsed_resp:
if "Slain" in line:
broken = line.split()
slain_by = broken[3]
if (slain_by in slain):
slain[slain_by] += 1
else:
slain[slain_by] = 1

You have closed(reason) function, it is called when the spider finishes.
def closed(self, reason):
for k,v in self.slain.iteritems():
print "Slain by %s: %d" % (k,v)

Related

Python scrapy working (only half of the time)

I created a python scrapy project to extract the prices of some google flights.
I configured the middleware to use PhantomJS instead of a normal browser.
class JSMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.PhantomJS()
try:
driver.get(request.url)
time.sleep(1.5)
except e:
raise ValueError("request url failed - \n url: {},\n error:
{}").format(request.url, e)
body = driver.page_source
#encoding='utf-8' - add to html response if necessary
return HtmlResponse(driver.current_url, body=body,encoding='utf-8',
request=request)
In the settings.py i added:
DOWNLOADER_MIDDLEWARES = {
# key path intermediate class, order value of middleware
'scraper_module.middlewares.middleware.JSMiddleware' : 543 ,
# prohibit the built-in middleware
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None , } `
I also created the following spider class:
import scrapy
from scrapy import Selector
class Gspider(scrapy.Spider):
name = "google_spider"
def __init__(self):
self.start_urls = ["https://www.google.pt/flights/#search;f=LIS;t=POR;d=2017-06-18;r=2017-06-22"]
self.prices = []
self.links = []
def clean_price(self, part):
#part received as a list
#the encoding is utf-8
part = part[0]
part = part.encode('utf-8')
part = filter(str.isdigit, part)
return part
def clean_link(self, part):
part = part[0]
part = part.encode('utf-8')
return part
def get_part(self, var_holder, response, marker, inner_marker, amount = 1):
selector = Selector(response)
divs = selector.css(marker)
for n, div in enumerate(divs):
if n < amount:
part = div.css(inner_marker).extract()
if inner_marker == '::text':
part = self.clean_price(part)
else:
part = self.clean_link(part)
var_holder.append(part)
else:
break
return var_holder
def parse(self, response):
prices, links = [], []
prices = self.get_part(prices, response, 'div.OMOBOQD-d-Ab', '::text')
print prices
links = self.get_part(links, response, 'a.OMOBOQD-d-X', 'a::attr(href)')
print links
The problem is, I run the code in the shell, and around half of the times I successfully get the prices and links requested, but another half of the time, the final vectors which should contain the extracted data, are empty.
I do not get any errors during execution.
Does anyone have any idea about why this is happening?
here are the logs from the command line:
Google has a very strict policy in terms of crawling. (Pretty hypocritical when you know that they constently crawl all the web...)
You should either find an API, as said previously in the comments or maybe use proxies. An easy way is to use Crawlera. It manages thousands of proxies so you don't have to bother. I personnaly use it to crawl google and it works perfectly. The downside is that it is not free.

unable to scrape a text file through a given link and store it in an output file

I'm trying to collect the weather data for year 2000 from this site.
My code for the spider is:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class weather(CrawlSpider):
name = 'data'
start_urls = [
'https://www1.ncdc.noaa.gov/pub/data/uscrn/products/daily01/'
]
custom_settings = {
'DEPTH_LIMIT': '2',
}
rules = (
Rule(LinkExtractor(restrict_xpaths=
('//table/tr[2]/td[1]/a',)),callback='parse_item',follow=True),
)
def parse_item(self, response):
for b in response.xpath('//table')
yield scrapy.request('/tr[4]/td[2]/a/#href').extract()
yield scrapy.request('/tr[5]/td[2]/a/#href').extract()
The paths with 'yield' are the links to two text file and i want to scrape the data from these text files and store it separately in two different files but I don't know how to continue.
I don't usually use CrawlSpider so I'm unfamiliar with it, but it seems like you should be able to create another xpath (preferably something more specific than "/tr[4]/td[2]/a/#href") and supply a callback function.
However, in a "typical" scrapy project using Spider instead of CrawlSpider, you would simply yield a Request with another callback function to handle extracting and storing to the database. For example:
def parse_item(self, response):
for b in response.xpath('//table')
url = b.xpath('/tr[4]/td[2]/a/#href').extract_first()
yield scrapy.Request(url=url, callback=extract_and_store)
url = b.xpath('/tr[5]/td[2]/a/#href').extract_first()
yield scrapy.Request(url=url, callback=extract_and_store)
def extract_and_store(self, response):
"""
Scrape data and store it separately
"""

How to order NDB query by the key?

I try to use task queues on Google App Engine. I want to utilize the Mapper class shown in the App Engine documentation "Background work with the deferred library".
I get an exception on the ordering of the query result by the key
def get_query(self):
...
q = q.order("__key__")
...
Exception:
File "C:... mapper.py", line 41, in get_query
q = q.order("__key__")
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\ext\ndb\query.py", line 1124, in order
'received %r' % arg)
TypeError: order() expects a Property or query Order; received '__key__'
INFO 2017-03-09 11:56:32,448 module.py:806] default: "POST /_ah/queue/deferred HTTP/1.1" 500 114
The article is from 2009, so I guess something might have changed.
My environment: Windows 7, Python 2.7.9, Google App Engine SDK 1.9.50
There are somewhat similar questions about ordering in NDB on SO.
What bugs me this code is from the official doc, presumably updated in Feb 2017 (recently) and posted by someone within top 0.1 % of SO users by reputation.
So I must be doing something wrong. What is the solution?
Bingo.
Avinash Raj is correct. If it were an answer I'd accept it.
Here is the full class code
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from google.appengine.ext import deferred
from google.appengine.ext import ndb
from google.appengine.runtime import DeadlineExceededError
import logging
class Mapper(object):
"""
from https://cloud.google.com/appengine/docs/standard/python/ndb/queries
corrected with suggestions from Stack Overflow
http://stackoverflow.com/questions/42692319/how-to-order-ndb-query-by-the-key
"""
# Subclasses should replace this with a model class (eg, model.Person).
KIND = None
# Subclasses can replace this with a list of (property, value) tuples to filter by.
FILTERS = []
def __init__(self):
logging.info("Mapper.__init__: {}")
self.to_put = []
self.to_delete = []
def map(self, entity):
"""Updates a single entity.
Implementers should return a tuple containing two iterables (to_update, to_delete).
"""
return ([], [])
def finish(self):
"""Called when the mapper has finished, to allow for any final work to be done."""
pass
def get_query(self):
"""Returns a query over the specified kind, with any appropriate filters applied."""
q = self.KIND.query()
for prop, value in self.FILTERS:
q = q.filter(prop == value)
if __name__ == '__main__':
q = q.order(self.KIND.key) # the fixed version. The original q.order('__key__') failed
# see http://stackoverflow.com/questions/42692319/how-to-order-ndb-query-by-the-key
return q
def run(self, batch_size=100):
"""Starts the mapper running."""
logging.info("Mapper.run: batch_size: {}".format(batch_size))
self._continue(None, batch_size)
def _batch_write(self):
"""Writes updates and deletes entities in a batch."""
if self.to_put:
ndb.put_multi(self.to_put)
self.to_put = []
if self.to_delete:
ndb.delete_multi(self.to_delete)
self.to_delete = []
def _continue(self, start_key, batch_size):
q = self.get_query()
# If we're resuming, pick up where we left off last time.
if start_key:
key_prop = getattr(self.KIND, '_key')
q = q.filter(key_prop > start_key)
# Keep updating records until we run out of time.
try:
# Steps over the results, returning each entity and its index.
for i, entity in enumerate(q):
map_updates, map_deletes = self.map(entity)
self.to_put.extend(map_updates)
self.to_delete.extend(map_deletes)
# Do updates and deletes in batches.
if (i + 1) % batch_size == 0:
self._batch_write()
# Record the last entity we processed.
start_key = entity.key
self._batch_write()
except DeadlineExceededError:
# Write any unfinished updates to the datastore.
self._batch_write()
# Queue a new task to pick up where we left off.
deferred.defer(self._continue, start_key, batch_size)
return
self.finish()

How to scrape pages after login

I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy

xpath not getting selected

I have just started using Scrapy:
Here is an example of a website that I want to crawl :
http://www.thefreedictionary.com/shame
The code for my Spider :
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from dic_crawler.items import DicCrawlerItem
from urlBuilder import *
class Dic_crawler(BaseSpider):
name = "dic"
allowed_domains = ["www.thefreedictionary.com"]
start_urls = listmaker()[:]
print start_urls
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[#id="MainTxt"]/table/tbody')
print 'SITES:\n',sites
item = DicCrawlerItem()
item["meanings"] = sites.select('//*[#id="MainTxt"]/table/tbody/tr/td/div[1]/div[1]/div[1]/text()').extract()
print item
return item
The listmaker() returns a list of urls to scrap.
My problem is that the sites variable comes up empty if I select till 'tbody' in the xpath and returns an empty sites variable, Whereas if I select only table I get the part of the site I want.
I am not able to retrieve the meaning for a word as a result of this into item["meanings"] since the part after tbody is does not select beyond tbody.
Also while at it, the site gives multiple meanings which I would like to extract but I only know how to extract a single method.
Thanks
Here's a spider skeleton to get you started:
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
class Dic_crawler(BaseSpider):
name = "thefreedictionary"
allowed_domains = ["www.thefreedictionary.com"]
start_urls = ['http://www.thefreedictionary.com/shame']
def parse(self, response):
hxs = HtmlXPathSelector(response)
# loop on each "noun" or "verb" or something... section
for category in hxs.select('id("MainTxt")//div[#class="pseg"]'):
# this is simply to get what's in the <i> tag
category_name = u''.join(category.select('./i/text()').extract())
self.log("category: %s" % category_name)
# for each category, a term can have multiple definition
# category from .select() is a selector
# so you can call .select() on it also,
# here with a relative XPath expression selecting all definitions
for definition in category.select('div[#class="ds-list"]'):
definition_text = u'\n'.join(
definition.select('.//text()').extract())
self.log(" - definition: %s" % definition_text)