Python scrapy working (only half of the time) - python-2.7

I created a python scrapy project to extract the prices of some google flights.
I configured the middleware to use PhantomJS instead of a normal browser.
class JSMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.PhantomJS()
try:
driver.get(request.url)
time.sleep(1.5)
except e:
raise ValueError("request url failed - \n url: {},\n error:
{}").format(request.url, e)
body = driver.page_source
#encoding='utf-8' - add to html response if necessary
return HtmlResponse(driver.current_url, body=body,encoding='utf-8',
request=request)
In the settings.py i added:
DOWNLOADER_MIDDLEWARES = {
# key path intermediate class, order value of middleware
'scraper_module.middlewares.middleware.JSMiddleware' : 543 ,
# prohibit the built-in middleware
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' : None , } `
I also created the following spider class:
import scrapy
from scrapy import Selector
class Gspider(scrapy.Spider):
name = "google_spider"
def __init__(self):
self.start_urls = ["https://www.google.pt/flights/#search;f=LIS;t=POR;d=2017-06-18;r=2017-06-22"]
self.prices = []
self.links = []
def clean_price(self, part):
#part received as a list
#the encoding is utf-8
part = part[0]
part = part.encode('utf-8')
part = filter(str.isdigit, part)
return part
def clean_link(self, part):
part = part[0]
part = part.encode('utf-8')
return part
def get_part(self, var_holder, response, marker, inner_marker, amount = 1):
selector = Selector(response)
divs = selector.css(marker)
for n, div in enumerate(divs):
if n < amount:
part = div.css(inner_marker).extract()
if inner_marker == '::text':
part = self.clean_price(part)
else:
part = self.clean_link(part)
var_holder.append(part)
else:
break
return var_holder
def parse(self, response):
prices, links = [], []
prices = self.get_part(prices, response, 'div.OMOBOQD-d-Ab', '::text')
print prices
links = self.get_part(links, response, 'a.OMOBOQD-d-X', 'a::attr(href)')
print links
The problem is, I run the code in the shell, and around half of the times I successfully get the prices and links requested, but another half of the time, the final vectors which should contain the extracted data, are empty.
I do not get any errors during execution.
Does anyone have any idea about why this is happening?
here are the logs from the command line:

Google has a very strict policy in terms of crawling. (Pretty hypocritical when you know that they constently crawl all the web...)
You should either find an API, as said previously in the comments or maybe use proxies. An easy way is to use Crawlera. It manages thousands of proxies so you don't have to bother. I personnaly use it to crawl google and it works perfectly. The downside is that it is not free.

Related

Scrapy crawler not recursively crawling next page

I am trying to build this crawler to get housing data from craigslist,
but the crawler stops after fetching the first page and does not go to the next page .
Here is the code , it works for the first page ,but for the love of god I dont understand why it does not get to the next page .Any insight is really appreciated .I followed this part from scrapy tutorial
import scrapy
import re
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = "craigslistmm"
start_urls = [
"https://vancouver.craigslist.ca/search/hhh"
]
def parse_second(self,response):
#need all the info in a dict
meta_dict = response.meta
for q in response.css("section.page-container"):
meta_dict["post_details"]= {
"location":
{"longitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-longitude)" ).extract(),
"latitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-latitude)" ).extract()},
"detailed_info": ' '.join(q.css('section#postingbody::text').extract()).strip()
}
return meta_dict
def parse(self, response):
pattern = re.compile("\/([a-z]+)\/([a-z]+)\/.+")
for q in response.css("li.result-row"):
post_urls = q.css("p.result-info a::attr(href)").extract_first()
mm = re.match(pattern, post_urls)
neighborhood= q.css("p.result-info span.result-meta span.result-hood::text").extract_first()
next_url = "https://vancouver.craigslist.ca/"+ post_urls
request = scrapy.Request(next_url,callback=self.parse_second)
#next_page = response.xpath('.//a[#class="button next"]/#href').extract_first()
#follow_url = "https://vancouver.craigslist.ca/" + next_page
#request1 = scrapy.Request(follow_url,callback=self.parse)
#yield response.follow(next_page,callback = self.parse)
request.meta['id'] = q.css("li.result-row::attr(data-pid)").extract_first()
request.meta['pricevaluation'] = q.css("p.result-info span.result-meta span.result-price::text").extract_first()
request.meta["information"] = q.css("p.result-info span.result-meta span.housing::text" ).extract_first()
request.meta["neighborhood"] =q.css("p.result-info span.result-meta span.result-hood::text").extract_first()
request.meta["area"] = mm.group(1)
request.meta["adtype"] = mm.group(2)
yield request
#yield scrapy.Request(follow_url, callback=self.parse)
next_page = LinkExtractor(allow="s=\d+").extract_links(response)[0]
# = "https://vancouver.craigslist.ca/" + next_page
yield response.follow(next_page.url,callback=self.parse)
The problem seems to be with the next_page extraction using LinkExtractor. If you look in the look, you'll see duplicate requests being filtered. There are more links on the page that satisfy your extraction rule and maybe they are not extracted in any particular order (or not in the order you wish).
I think better approach is to extract exactly the information you want, try it with this:
next_page = response.xpath('//span[#class="buttons"]//a[contains(., "next")]/#href').extract_first()

How run an spider sequentially to sites that use session in scrapy

I wanna scrape a web page that first send an AjaxFormPost that open a session and next send an _SearchResultGridPopulate to populate a control that I need to scrape, the response is a json.
this is a fragment of my code:
def parse_AjaxFormPost(self, response):
self.logger.info("parse_AjaxFormPost")
page = response.meta['page']
header = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Content-Length':'14',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'ASP.NET_SessionId=gq4dgcsl500y32xb1n2ciexq',
.
.
.
}
url = '<url>/Search/AjaxFormPost'
cities = ['city1','city2',...]
for city in cities:
formData = {
'City':city
}
re = scrapy.FormRequest(
url,
formdata=formData,
headers=header,
dont_filter=True,
callback=self.parse_GridPopulate
)
yield re
def parse_GridPopulate(self,response):
self.logger.info("parse_LookupPermitTypeDetails")
url = '<url>/Search//_SearchResultGridPopulate?Grid-page=2&Grid-size=10&Grid-CERT_KEYSIZE=128&Grid-CERT_SECRETKEYSIZE=2048&Grid-HTTPS_KEYSIZE=128&Grid-HTTPS_SECRETKEYSIZE=2048'
header = {
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Content-Length':'23',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'ASP.NET_SessionId=gq4dgcsl500y32xb1n2ciexq',
.
.
.
}
formData = {
'page':'1',
'size':'10'
}
re = scrapy.FormRequest(
url,
formdata=formData,
headers=header,
dont_filter=True,
callback=self.parse
)
yield re
def parse(self, response):
self.logger.info("parse_permit")
data_json = json.loads(response.body)
for row in data_json["data"]:
self.logger.info(row)
item = RedmondPermitItem()
item["item1"] = row["item1"]
item["item2"] = row["item2"]
yield item
The problem is that scrapy do request concurrent and when and the request in parse_AjaxFormPost open a session so when pass to the parse_LookupPermitTypeDetails I got the session of the last request do it in parse_AjaxFormPost. So if I have 10 cities at the end I got 10 times the information of the last city.
In settings I changed the configuration:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1
And it doesn't work. On other hand I thought in run the spider only for one city every time something like
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your first spider definition
...
...
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
cities = ['city1','city2',...]
for city in cities:
yield runner.crawl(MySpider1,city=city)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Maybe this can be the only one solution, but I'm not sure. I would like to create a procedure for every site with this characteristic.
Any suggestion about how solve that, is possible to achive this only configuring settings.
thanks in advance.
Update1
I change the title because is important that is for sites that use session
This is a problem of understanding how concurrency works, as this isn't parallelism you can still work sequentially, but between callbacks. I would suggest something like this:
def parse_AjaxFormPost(self, response):
...
cities = ['city1','city2',...]
formData = {
'City':cities[0]
}
re = scrapy.FormRequest(
url,
formdata=formData,
headers=header,
dont_filter=True,
callback=self.parse_remaining_cities,
meta={'remaining_cities': cities[1:]}, # check the meta argument
)
yield re
def parse_remaining_cities(self, response):
remaining_cities = response.meta['remaining_cities']
current_city = remaining_cities[0]
...
yield Request(
...,
meta={'remaining_cities': remaining_cities[1:]},
callback=self.parse_remaining_cities)
This way you are doing one request at a time and in a row from city to city.

How to order NDB query by the key?

I try to use task queues on Google App Engine. I want to utilize the Mapper class shown in the App Engine documentation "Background work with the deferred library".
I get an exception on the ordering of the query result by the key
def get_query(self):
...
q = q.order("__key__")
...
Exception:
File "C:... mapper.py", line 41, in get_query
q = q.order("__key__")
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\ext\ndb\query.py", line 1124, in order
'received %r' % arg)
TypeError: order() expects a Property or query Order; received '__key__'
INFO 2017-03-09 11:56:32,448 module.py:806] default: "POST /_ah/queue/deferred HTTP/1.1" 500 114
The article is from 2009, so I guess something might have changed.
My environment: Windows 7, Python 2.7.9, Google App Engine SDK 1.9.50
There are somewhat similar questions about ordering in NDB on SO.
What bugs me this code is from the official doc, presumably updated in Feb 2017 (recently) and posted by someone within top 0.1 % of SO users by reputation.
So I must be doing something wrong. What is the solution?
Bingo.
Avinash Raj is correct. If it were an answer I'd accept it.
Here is the full class code
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from google.appengine.ext import deferred
from google.appengine.ext import ndb
from google.appengine.runtime import DeadlineExceededError
import logging
class Mapper(object):
"""
from https://cloud.google.com/appengine/docs/standard/python/ndb/queries
corrected with suggestions from Stack Overflow
http://stackoverflow.com/questions/42692319/how-to-order-ndb-query-by-the-key
"""
# Subclasses should replace this with a model class (eg, model.Person).
KIND = None
# Subclasses can replace this with a list of (property, value) tuples to filter by.
FILTERS = []
def __init__(self):
logging.info("Mapper.__init__: {}")
self.to_put = []
self.to_delete = []
def map(self, entity):
"""Updates a single entity.
Implementers should return a tuple containing two iterables (to_update, to_delete).
"""
return ([], [])
def finish(self):
"""Called when the mapper has finished, to allow for any final work to be done."""
pass
def get_query(self):
"""Returns a query over the specified kind, with any appropriate filters applied."""
q = self.KIND.query()
for prop, value in self.FILTERS:
q = q.filter(prop == value)
if __name__ == '__main__':
q = q.order(self.KIND.key) # the fixed version. The original q.order('__key__') failed
# see http://stackoverflow.com/questions/42692319/how-to-order-ndb-query-by-the-key
return q
def run(self, batch_size=100):
"""Starts the mapper running."""
logging.info("Mapper.run: batch_size: {}".format(batch_size))
self._continue(None, batch_size)
def _batch_write(self):
"""Writes updates and deletes entities in a batch."""
if self.to_put:
ndb.put_multi(self.to_put)
self.to_put = []
if self.to_delete:
ndb.delete_multi(self.to_delete)
self.to_delete = []
def _continue(self, start_key, batch_size):
q = self.get_query()
# If we're resuming, pick up where we left off last time.
if start_key:
key_prop = getattr(self.KIND, '_key')
q = q.filter(key_prop > start_key)
# Keep updating records until we run out of time.
try:
# Steps over the results, returning each entity and its index.
for i, entity in enumerate(q):
map_updates, map_deletes = self.map(entity)
self.to_put.extend(map_updates)
self.to_delete.extend(map_deletes)
# Do updates and deletes in batches.
if (i + 1) % batch_size == 0:
self._batch_write()
# Record the last entity we processed.
start_key = entity.key
self._batch_write()
except DeadlineExceededError:
# Write any unfinished updates to the datastore.
self._batch_write()
# Queue a new task to pick up where we left off.
deferred.defer(self._continue, start_key, batch_size)
return
self.finish()

How to scrape pages after login

I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy

Aggregate data from list of Requests in a crawler

Perhaps I am missing something simple, so I hope this is an easy question. I am using Scrapy to parse a directory listing and then pull down each appropriate web page (actually a text file) and parse it out using Python.
Each page has a set of data I am interested in, and I update a global dictionary each time I encounter such an item in each page.
What I would like to do is simply print out an aggregate summary when all Request calls are complete, however after the yield command nothing runs. I am assuming because yield is actually returning a generator and bailing out.
I'd like to avoid writing a file for each Request object if possible...I'd rather keep it self contained within this Python script.
Here is the code I am using:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
slain = {}
class GameSpider(BaseSpider):
name = "game"
allowed_domains = ["example.org"]
start_urls = [
"http://example.org/users/deaths/"
]
def parse(self, response):
links = response.xpath('//a/#href').extract()
for link in links:
if 'txt' in link:
l = self.start_urls[0] + link
yield Request(l, callback=self.parse_following, dont_filter=True)
# Ideally print out the aggregate after all Requests are satisfied
# print "-----"
# for k,v in slain.iteritems():
# print "Slain by %s: %d" % (k,v)
# print "-----"
def parse_following(self, response):
parsed_resp = response.body.rstrip().split('\n')
for line in parsed_resp:
if "Slain" in line:
broken = line.split()
slain_by = broken[3]
if (slain_by in slain):
slain[slain_by] += 1
else:
slain[slain_by] = 1
You have closed(reason) function, it is called when the spider finishes.
def closed(self, reason):
for k,v in self.slain.iteritems():
print "Slain by %s: %d" % (k,v)