Scrape data from Infinite scrolling using scrapy - python-2.7

I'm new to python and scrapy.
I want to scrap data from website.
The web site uses AJAX for scrolling.
The get request url is as below.
http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate=
Please help me how I can use scrapy or any other python libraries
Thanks.

Seems like this AJAX request expects a correct Referer header, which is just a url of the current page. You can simply set the header when creating the request:
def parse(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
def parse_ajax(self, response):
# results should be here

Related

Django url error when GET request with params,

I have a Django API to export file which needs format as input,
Request URL: http://192.168.5.51:1212/rest/tasks/export_file/CIAYEK5W5JS4MdmCF2t8eB?format=xyz
this request returns a error response
detail (with status 404)
but when I use get request without query params,
Request URL: http://192.168.5.51:1212/rest/tasks/export_file/CIAYEK5W5JS4MdmCF2t8eB
The API is triggered and the file is returned(with default format). As far as i know we don't need to change anything in urlpatterns to support query params. I have also put the specified url in first line to eliminate the chance for any other regex catching the request
urlpatterns = [
url(r'^/export_file/(?P<pk>.+)$',views.TaskFileTranscript.as_view()),
How to support query params in django requests. Thank you in advance.
P.S : Im pretty sure the control is not reaching into the get function, Im using DRF.
Thanks Everyone, it is fixed now. got to know its really a bad idea to use 'format' as query param key.
you have to customize it yourself for example in a class-based view get method:
class Test(APIView):
def get(self, request):
formatQuery = request.GET.get('format', None)
# use this format to customize your result
return Response(result)
and also add / to end of your URL:
http://192.168.5.51:1212/rest/tasks/export_file/CIAYEK5W5JS4MdmCF2t8eB/?format=xyz

python django url shows old image when exachanging

when running the Django server and hitting the URL http://127.0.0.1:8000/media/pictures/h2.jpg, I was getting the requested image (jpg).
Now I exchange the jpg by a file, which is also called h2.jpg but when I call the same URL again
it still shows the old picture.
How to handle that?
I need to do it automatically by the backend or somehow --> without user action
Django version 2.1.7
You can use this middleware
from django.utils.cache import add_never_cache_headers
class NoCachingMiddleware(object):
def process_response(self, request, response):
add_never_cache_headers(response)
return respons
from this question:
https://stackoverflow.com/a/13489175/11027652
So, now the new file has a timestamp included in the filename. By this I can first read all files available in the folder and then take the first one to create a NEW dynamic filepath.

Python Scrapy Click on html button

I am new to scrapy and using scrapy with python 2.7 for web automation. I want to click on a html button on a website which opens a login form. My problem is that I just want to click on a button and trasfer control to new page. I have read all similar questions but none found satisfactory because they all contain direct login or using selenium.
Below is HTML Code for button and I want to visit http://example.com/login where there is login page.
<div class="pull-left">
Employers
I have written code for to extract link. But how to visit that link and carry out next process. Below is My code.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'pro'
url = "http://login-page.com/"
def start_requests(self):
yield scrapy.Request(self.url, self.parse_login)
def parse_login(self, response):
employers = response.css("div.pull-left a::attr(href)").extract_first()
print employers
Do I need to use "yield" Everytime and callback to new fuction for just visiting a link or there is other way to do it.
What you need is to yield a new request or easier make a response.follow like in the docs:
def parse_login(self, response):
next_page = response.css("div.pull-left a::attr(href)").extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.next_page_parse)
About the callback, it depends basically on how easily can the page gets parsed, for example, check the general spiders section on the docs

Refresh scrapy response after selenium browser Click

I m trying to scrape a website that uses Ajax to load the different pages.
Although my selenium browser is navigating through all the pages, but scrapy response is still the same and it ends up scraping same response(no of pages times).
Proposed Solution :
I read in some answers that by using
hxs = HtmlXPathSelector(self.driver.page_source)
You can change the page source and then scrape. But it is not working ,also after adding this the browser also stopped navigating.
code
def parse(self, response):
self.driver.get(response.url)
pages = (int)(response.xpath('//p[#class="pageingP"]/a/text()')[-2].extract())
for i in range(pages):
next = self.driver.find_element_by_xpath('//a[text()="Next"]')
print response.xpath('//div[#id="searchResultDiv"]/h3/text()').extract()[0]
try:
next.click()
time.sleep(3)
#hxs = HtmlXPathSelector(self.driver.page_source)
for sel in response.xpath("//tr/td/a"):
item = WarnerbrosItem()
item['url'] = response.urljoin(sel.xpath('#href').extract()[0])
request = scrapy.Request(item['url'],callback=self.parse_job_contents,meta={'item': item}, dont_filter=True)
yield request
except:
break
self.driver.close()
Please Help.
When using selenium and scrapy together, after having selenium perform the click I've read the page back for scrapy using
resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
That would go where your HtmlXPathSelector selector line went. All the scrapy code from that point to the end of the routine would then need to refer to resp (page rendered after the click) rather than response (page rendered before the click).
The time.sleep(3) may give you issues as it doesn't guarantee the page has actually loaded, it's just an unconditional wait. It might be better to use something like
WebDriverWait(self.driver, 30).until(test page has changed)
which waits until the page you are waiting for passes a specific test, such as finding the expected page number or manufacturer's part number.
I'm not sure what the impact of closing the driver at the end of every pass through parse() is. I've used the following snippet in my spider to close the driver when the spider is closed.
def __init__(self, filename=None):
# wire us up to selenium
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
Selenium isn't in any way connected with scrapy, nor their response object, and in your code I don't see you changing the response object.
You'll have to work with them independently.

How to save the latest url requests in django?

I'd like to add a 'Last seen' url list to a project, so that last 5 articles requested by users can be displayed in the list to all users.
I've read the middleware docs but could not figure out how to use it in my case.
What I need is a simple working example of a middleware that captures the requests so that they can be saved and reused.
Hmm, don't know if I would do it with middleware, or right a decorator. But as your question is about Middleware, here my example:
class ViewLoggerMiddleware(object):
def process_response(self, request, response):
# We only want to save successful responses
if response.status_code not in [200, 302]:
return response
ViewLogger.objects.create(user_id=request.user.id,
view_url=request.get_full_path(), timestamp=timezone.now())
Showing Top 5 would be something like;
ViewLogger.objects.filter(user_id=request.user.id).order_by("-timestamp")[:5]
Note: Code is not tested, I'm not sure if status_code is a real attribute of response. Also, you could change your list of valid status codes.