My objective is to use pyquery with scrapy, apparently from scrapy.selector import PyQuerySelector returns ImportError: cannot import name PyQuerySelector when I crawl the spider.
I followed this specific gist https://gist.github.com/joehillen/795180 to implement pyquery.
Any suggestions or tutorials that can help me get this job done?
You declare a class and make your rules and in the callback attribute of rule extractor give parse_item by default the scrapy goes parse() function
def parse_item(self, response):
pyquery_obj = PyQuery(response.body)
header = self.get_header(pyquery_obj)
return {
'header': header,
}
def get_header(self, pyquery_obj):
return pyquery_obj('#page_head').text()
Related
Given a uri like /home/ I want to find the view function that this corresponds to, preferably in a form like app.views.home or just <app_label>.<view_func>. Is there a function that will give me this?
You can use the resolve method provided by django to get the function. You can use the __module__ attribute of the function returned to get the app label. This will return a string like project.app.views. So something like this:
from django.urls import resolve
myfunc, myargs, mykwargs = resolve("/hello_world/")
mymodule = myfunc.__module__
In case one needs the class of the view since a class based view is being used one can access the view_class of the returned function:
view_class = myfunc.view_class
From Django 2.0 onward django.core.urlresolvers module has been moved to django.urls.
You will need to do this:
from django.urls import resolve
myfunc, myargs, mykwargs = resolve("/hello_world/")
mymodule = myfunc.__module__
Since Django 1.3 (March 2011) the resolve function in the urlresolvers module returns a ResolverMatch object. Which provides access to all attributes of the resolved URL match, including the view callable path.
>>> from django.core.urlresolvers import resolve
>>> match = resolve('/')
>>> match.func
<function apps.core.views.HomeView>
>>> match._func_path
'apps.core.views.HomeView'
1. Generate a text file with all URLs with corresponding view functions
./manage.py show_urls --format pretty-json --settings=<path-to-settings> > urls.txt
example
./manage.py show_urls --format pretty-json --settings=settings2.testing > urls.txt
2. Look for your URL in the output file urls.txt
{
"url": "/v3/affiliate/commission/",
"module": "api.views.affiliate.CommissionView",
"name": "affiliate-commission",
},
All the others focus on just the module or string representation of the view. However, if you want to directly access the view object for some reason, this could be handy
resolve('the_path/').func.cls
This gives the view object itself, this works on class based view, I haven't tested it on a function based view though.
Based on KillianDS's answer, here's my solution:
from django.core.urlresolvers import resolve
def response(request, template=None, vars={}):
if template is None:
view_func = resolve(request.META['REQUEST_URI'])[0]
app_label = view_func.__module__.rsplit('.', 1)[1]
view_name = view_func.__name__
template = '%s.html' % os.path.join(app_label, view_name)
return render_to_response(template, vars, context_instance=RequestContext(request))
Now you can just call return response(request) at the end of your view funcs and it will automatically load up app/view.html as the template and pass in the request context.
I'm trying to collect the weather data for year 2000 from this site.
My code for the spider is:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class weather(CrawlSpider):
name = 'data'
start_urls = [
'https://www1.ncdc.noaa.gov/pub/data/uscrn/products/daily01/'
]
custom_settings = {
'DEPTH_LIMIT': '2',
}
rules = (
Rule(LinkExtractor(restrict_xpaths=
('//table/tr[2]/td[1]/a',)),callback='parse_item',follow=True),
)
def parse_item(self, response):
for b in response.xpath('//table')
yield scrapy.request('/tr[4]/td[2]/a/#href').extract()
yield scrapy.request('/tr[5]/td[2]/a/#href').extract()
The paths with 'yield' are the links to two text file and i want to scrape the data from these text files and store it separately in two different files but I don't know how to continue.
I don't usually use CrawlSpider so I'm unfamiliar with it, but it seems like you should be able to create another xpath (preferably something more specific than "/tr[4]/td[2]/a/#href") and supply a callback function.
However, in a "typical" scrapy project using Spider instead of CrawlSpider, you would simply yield a Request with another callback function to handle extracting and storing to the database. For example:
def parse_item(self, response):
for b in response.xpath('//table')
url = b.xpath('/tr[4]/td[2]/a/#href').extract_first()
yield scrapy.Request(url=url, callback=extract_and_store)
url = b.xpath('/tr[5]/td[2]/a/#href').extract_first()
yield scrapy.Request(url=url, callback=extract_and_store)
def extract_and_store(self, response):
"""
Scrape data and store it separately
"""
I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy
I have to crawl a Web Site, so I use Scrapy to do it, but I need to pass a cookie to bypass the first page (which is a kind of login page, you choose you location)
I heard on the web that you need to do this with a base Spider (not a Crawl Spider), but I need to use a Crawl Spider to do my crawling, so what do I need to do?
At first a Base Spider? then launch my Crawl spider? But I don't know if cookie will be passed between them or how do I do it? How to launch a spider from another spider?
How to handle cookie? I tried with this
def start_requests(self):
yield Request(url='http://www.auchandrive.fr/drive/St-Quentin-985/', cookies={'auchanCook': '"985|"'})
But not working
My answer should be here, but the guy is really evasive and I don't know what to do.
First, you need to add open cookies in settings.py file
COOKIES_ENABLED = True
Here is my testing spider code for your reference. I tested it and passed
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
class Stackoverflow23370004Spider(CrawlSpider):
name = 'auchandrive.fr'
allowed_domains = ["auchandrive.fr"]
target_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"
def start_requests(self):
yield Request(self.target_url,cookies={'auchanCook': "985|"}, callback=self.parse_page)
def parse_page(self, response):
if 'St-Quentin-985' in response.url:
self.log("Passed : %r" % response.url,log.DEBUG)
else:
self.log("Failed : %r" % response.url,log.DEBUG)
You can run command to test and watch the console output:
scrapy crawl auchandrive.fr
I noticed that in your code snippet, you were using cookies={'auchanCook': '"985|"'}, instead of cookies={'auchanCook': "985|"}.
This should get you started:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class AuchanDriveSpider(CrawlSpider):
name = 'auchandrive'
allowed_domains = ["auchandrive.fr"]
# pseudo-start_url
begin_url = "http://www.auchandrive.fr/"
# start URL used as shop selection
select_shop_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[#class="header-menu"]',))),
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(#class, "vignette-content")]',)),
callback='parse_product'),
)
def start_requests(self):
yield Request(self.begin_url, callback=self.select_shop)
def select_shop(self, response):
return Request(url=self.select_shop_url, cookies={'auchanCook': "985|"})
def parse_product(self, response):
self.log("parse_product: %r" % response.url)
Pagination might be tricky.
I'm working on learning how to use scrapy. Especially scrapy with cookie handling. The problem is that I can not find a whole lot of examples, tutorials or documentation that can help me in this endeavor. If anyone could supply any material I would be really grateful. To show you how lost I am, the code below should show my lack of understanding;
from scrapy.spider import BaseSpider
from scrapy.http.cookies import CookieJar
class sasSpider(BaseSpider):
name = "sas"
allowed_domains = ["sas.no"]
start_urls = []
def parse(self, response):
Request("http://www.sas.no", meta={'cookiejar': response.meta['cookiejar']}, callback = self.nextfunction)
def nextfunction(self, response):
cookieJar = response.meta.setdefault('cookiejar', CookieJar())
cookieJar.extract_cookies(response, response.request)
for cookie in CookieJar:
open('cookies.html', 'wb').write(cookie)
If you want to manually add cookies, just pass them in:
yield Request("http://www.sas.no", cookies={
'foo': 'bar'
}, callback=self.nextfunction)
They'll be kept in all future requests. Just remember to do this in the start_requests callback if you want them to be there for all of the requests.
If you only want to attach with a predefined cookies, maybe overwrite make_requests_from_url is a better idea:
class MySpider(scrapy.Spider):
def make_requests_from_url(self, url):
return scrapy.Request(url=url,
cookies={'currency': 'USD', 'country': 'UY'})