I m trying to scrape a website that uses Ajax to load the different pages.
Although my selenium browser is navigating through all the pages, but scrapy response is still the same and it ends up scraping same response(no of pages times).
Proposed Solution :
I read in some answers that by using
hxs = HtmlXPathSelector(self.driver.page_source)
You can change the page source and then scrape. But it is not working ,also after adding this the browser also stopped navigating.
code
def parse(self, response):
self.driver.get(response.url)
pages = (int)(response.xpath('//p[#class="pageingP"]/a/text()')[-2].extract())
for i in range(pages):
next = self.driver.find_element_by_xpath('//a[text()="Next"]')
print response.xpath('//div[#id="searchResultDiv"]/h3/text()').extract()[0]
try:
next.click()
time.sleep(3)
#hxs = HtmlXPathSelector(self.driver.page_source)
for sel in response.xpath("//tr/td/a"):
item = WarnerbrosItem()
item['url'] = response.urljoin(sel.xpath('#href').extract()[0])
request = scrapy.Request(item['url'],callback=self.parse_job_contents,meta={'item': item}, dont_filter=True)
yield request
except:
break
self.driver.close()
Please Help.
When using selenium and scrapy together, after having selenium perform the click I've read the page back for scrapy using
resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
That would go where your HtmlXPathSelector selector line went. All the scrapy code from that point to the end of the routine would then need to refer to resp (page rendered after the click) rather than response (page rendered before the click).
The time.sleep(3) may give you issues as it doesn't guarantee the page has actually loaded, it's just an unconditional wait. It might be better to use something like
WebDriverWait(self.driver, 30).until(test page has changed)
which waits until the page you are waiting for passes a specific test, such as finding the expected page number or manufacturer's part number.
I'm not sure what the impact of closing the driver at the end of every pass through parse() is. I've used the following snippet in my spider to close the driver when the spider is closed.
def __init__(self, filename=None):
# wire us up to selenium
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
Selenium isn't in any way connected with scrapy, nor their response object, and in your code I don't see you changing the response object.
You'll have to work with them independently.
Related
I am new to scrapy and using scrapy with python 2.7 for web automation. I want to click on a html button on a website which opens a login form. My problem is that I just want to click on a button and trasfer control to new page. I have read all similar questions but none found satisfactory because they all contain direct login or using selenium.
Below is HTML Code for button and I want to visit http://example.com/login where there is login page.
<div class="pull-left">
Employers
I have written code for to extract link. But how to visit that link and carry out next process. Below is My code.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'pro'
url = "http://login-page.com/"
def start_requests(self):
yield scrapy.Request(self.url, self.parse_login)
def parse_login(self, response):
employers = response.css("div.pull-left a::attr(href)").extract_first()
print employers
Do I need to use "yield" Everytime and callback to new fuction for just visiting a link or there is other way to do it.
What you need is to yield a new request or easier make a response.follow like in the docs:
def parse_login(self, response):
next_page = response.css("div.pull-left a::attr(href)").extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.next_page_parse)
About the callback, it depends basically on how easily can the page gets parsed, for example, check the general spiders section on the docs
A year ago, I used Django's StreamingHttpResponse to stream a text file and Chrome immediately displayed every chunk of text that it received. Now, with the same code, Chrome only displays the text when it completely loads the text file, thus risks server timeout. This does not happen with Firefox.
I created a simple test:
# views.py
import time
from django.views import generic
class TestEditView(generic.TemplateView):
def generator(self):
for _ in range(15):
time.sleep(1)
yield 'THIS IS {}\n'.format(_)
print('LOG: THIS IS {}\n'.format(_))
def get(self, request, *args, **kwargs):
return StreamingHttpResponse(self.generator(),
content_type="text/plain; charset=utf-8")
If I access that view in Firefox, that browser will print out 'THIS IS ....' each second for 15 seconds. But in Chrome, the browser will wait 15 seconds, then print out all of the 'THIS IS...', even though the development server log 'LOG: THIS IS...' once a second.
I wonder if there is any subtlety in this problem that I missed. Thank you.
Python: 3.6.2.
Django: 1.10.5
Changing the content_type from "text/plain" to "text/html" or removing the content_type altogether solves the problem - it makes Chrome render each chunk of text immediately after it receives.
I'm new to python and scrapy.
I want to scrap data from website.
The web site uses AJAX for scrolling.
The get request url is as below.
http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate=
Please help me how I can use scrapy or any other python libraries
Thanks.
Seems like this AJAX request expects a correct Referer header, which is just a url of the current page. You can simply set the header when creating the request:
def parse(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
def parse_ajax(self, response):
# results should be here
Hello i am trying to scrape a ecommerce website which loads data with scroll and load more button I followed this How to scrape website with infinte scrolling? link but when I tried the code it's closing the spider without an products may be the structure has changed and I would like some help to get me starting I am quite new to webscraping.
edit question
ok i am scraping this website [link]http://www.jabong.com/women/ which has subcategories , i am trying to scrape all the subcategories products i tried above code but that didnt work for so after doing some research i created a code which works but doesnt satisfy my goal .so far i have tried this
` import scrapy
#from scrapy.exceptions import CloseSpider
from scrapy.spiders import Spider
#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from koovs.items import product
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
class propubSpider(scrapy.Spider):
name = 'koovs'
allowed_domains = ['jabong.com']
max_pages = 40
def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request('http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=%d' % i, callback=self.parse)
def parse(self, response):
for sel in response.xpath('//*[#id="catalog-product"]/section[2]'):
item = product()
item['price'] = sel.xpath('//*[#class="price"]/span[2]/text()').extract()
item['image'] = sel.xpath('//*[#class="primary-image thumb loaded"]/img').extract()
item['title'] = sel.xpath('//*[#data-original-href]/#href').extract()
`
the above code works for one category and also if i specify the number of pages, the above website has a lot of products for given category and i dont know in how many pages they reside , so i decided to use crawl spider to go through all the categories and product pages and fetch data but i am very new to scrapy any help would be highly appreciated
First thing you need to understand is the DOM structure of websites often change. So a scraper written in the past will may or may not work for you now.
So the best approach while scraping a website is to find a hidden api or a hidden url which can only be seen when you anlayze the network traffic of a website. This not just provide you a reliable solution for scraping but also save bandwidth which is very important while doing Broad crawling as most of the time you don't need to download the whole page.
Let's take the example of the website you are crawling to get more clarity. When you visit this page you can see the button which says Show More Product. Go to the developer tools of your browser and select the network analyzer. When you click on the button you will see the browser sending a GET request to this link. Check the response and you will see list of all the products in the first page. Now when you will analyze this link, you can see it has a parameter page=1. Change it to page=2 and you will see the list of all products in the second page.
Now go ahead and write the spider. It will be something like:
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request
from jabong.items import product
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=1&limit=52&sortField=popularity&sortBy=desc",
]
page = 1
def parse(self, response):
products = response.xpath('//*[#id="catalog-product"]/section[2]/div')
if not products:
raise CloseSpider("No more products!")
for p in products:
item = product()
item['price'] = p.xpath('a/div/div[2]/span[#class="standard-price"]/text()').extract()
item['title'] = p.xpath('a/div/div[1]/text()').extract()
if item['title']:
yield item
self.page += 1
yield Request(url="http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=%d&limit=52&sortField=popularity&sortBy=desc" % self.page,
callback=self.parse,
dont_filter=True)
N.B.- This example is just for educational purpose. Please refer to the website's Terms and Conditions/Privacy Policy/Robots.txt before crawling/scraping/storing any data from the website.
I'd like to add a 'Last seen' url list to a project, so that last 5 articles requested by users can be displayed in the list to all users.
I've read the middleware docs but could not figure out how to use it in my case.
What I need is a simple working example of a middleware that captures the requests so that they can be saved and reused.
Hmm, don't know if I would do it with middleware, or right a decorator. But as your question is about Middleware, here my example:
class ViewLoggerMiddleware(object):
def process_response(self, request, response):
# We only want to save successful responses
if response.status_code not in [200, 302]:
return response
ViewLogger.objects.create(user_id=request.user.id,
view_url=request.get_full_path(), timestamp=timezone.now())
Showing Top 5 would be something like;
ViewLogger.objects.filter(user_id=request.user.id).order_by("-timestamp")[:5]
Note: Code is not tested, I'm not sure if status_code is a real attribute of response. Also, you could change your list of valid status codes.