Python Scrapy Click on html button - python-2.7

I am new to scrapy and using scrapy with python 2.7 for web automation. I want to click on a html button on a website which opens a login form. My problem is that I just want to click on a button and trasfer control to new page. I have read all similar questions but none found satisfactory because they all contain direct login or using selenium.
Below is HTML Code for button and I want to visit http://example.com/login where there is login page.
<div class="pull-left">
Employers
I have written code for to extract link. But how to visit that link and carry out next process. Below is My code.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'pro'
url = "http://login-page.com/"
def start_requests(self):
yield scrapy.Request(self.url, self.parse_login)
def parse_login(self, response):
employers = response.css("div.pull-left a::attr(href)").extract_first()
print employers
Do I need to use "yield" Everytime and callback to new fuction for just visiting a link or there is other way to do it.

What you need is to yield a new request or easier make a response.follow like in the docs:
def parse_login(self, response):
next_page = response.css("div.pull-left a::attr(href)").extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.next_page_parse)
About the callback, it depends basically on how easily can the page gets parsed, for example, check the general spiders section on the docs

Related

Django - Prevent a User from Accessing a Page when He/She Typed the URL

I have a page in Django that I don't want to be accessed by anyone except when they clicked the specific link that I made for that page.
I'm aware about #login_required but the problem is I want the page to be restricted to EVERYONE.
I haven't tried any code yet since I absolutely have no idea how to do it. Even google did not give me answer. Please help
I have the same problem a few months back and way I solved it by making a POST request.
Whenever any user clicks on the link present on-page, I make a POST request at the Django application with some verification token sent in the POST request body.
You can generate any simple token mechanism and check for token validity in Django view and if success allows users to access that page.
The most common way to achieve this is to use randomized links.
In pseudo code
Page 1
<a href="/router/some-random-string">
# view serves '/secret-page'
class SecretView:
def _get(request):
# display real page here
def get(request):
return HttpNotFound()
# view serves '/router/<hash:str>'
class AccessorView(SecretView):
def get(request):
# get and validate hash
# if valid, display secret page
return super()._get(request)

Scrape data from Infinite scrolling using scrapy

I'm new to python and scrapy.
I want to scrap data from website.
The web site uses AJAX for scrolling.
The get request url is as below.
http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate=
Please help me how I can use scrapy or any other python libraries
Thanks.
Seems like this AJAX request expects a correct Referer header, which is just a url of the current page. You can simply set the header when creating the request:
def parse(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
def parse_ajax(self, response):
# results should be here

Django - How to stay on the same page without refreshing page?

I am using Boostrap modal fade window which renders Django form to update my database records. And what I fail to do is not to reload the page if the user has opened the Update window and did not change anything. It will be easier to get my idea if you look at the code below:
def updateTask(request, task_id):
#cur_usr_sale_point = PersonUnique.objects.filter(employees__employeeuser__auth_user = request.user.id).values_list('agreementemployees__agreement_unique__sale_point_id',flat=True)
selected_task = Tasks.objects.get(id=task_id)
task_form = TaskForm(instance=selected_task )
taskTable = Tasks.objects.all()
if request.method == 'POST':
task_form = TaskForm(request.POST,instance=selected_task)
if task_form.has_changed():
if task_form.is_valid():
# inside your model instance add each field with the wanted value for it
task_form.save();
return HttpResponseRedirect('/task_list/')
else: # The user did not change any data but I still tell Django to
#reload my page, thus wasting my time.
return HttpResponseRedirect('/task_list/')
return render_to_response('task_management/task_list.html',{'createTask_form':task_form, 'task_id': task_id, 'taskTable': taskTable},context_instance=RequestContext(request))
The question is, is there any way to tell Django to change the url (like it happens after redirecting) but not to load the same page with same data for the second time?
It's not trivial, but the basic steps you need are:
Write some javascript to usurp the form submit button click
Call your ajax function which sends data to "checking" view
Write a "checking" view that will check if form data has changed
If data have changed, submit the form
If not, just stay on page
This blog post is a nice walkthrough of the entire process (though targeted towards a different end result, you'll need to modify the view).
And here are some SO answers that will help with the steps above:
Basically:
$('#your-form-id').on('submit', function(event){
event.preventDefault();
your_ajax_function();
});
Call ajax function on form submit
Gotta do yourself!
Submit form after checking

Refresh scrapy response after selenium browser Click

I m trying to scrape a website that uses Ajax to load the different pages.
Although my selenium browser is navigating through all the pages, but scrapy response is still the same and it ends up scraping same response(no of pages times).
Proposed Solution :
I read in some answers that by using
hxs = HtmlXPathSelector(self.driver.page_source)
You can change the page source and then scrape. But it is not working ,also after adding this the browser also stopped navigating.
code
def parse(self, response):
self.driver.get(response.url)
pages = (int)(response.xpath('//p[#class="pageingP"]/a/text()')[-2].extract())
for i in range(pages):
next = self.driver.find_element_by_xpath('//a[text()="Next"]')
print response.xpath('//div[#id="searchResultDiv"]/h3/text()').extract()[0]
try:
next.click()
time.sleep(3)
#hxs = HtmlXPathSelector(self.driver.page_source)
for sel in response.xpath("//tr/td/a"):
item = WarnerbrosItem()
item['url'] = response.urljoin(sel.xpath('#href').extract()[0])
request = scrapy.Request(item['url'],callback=self.parse_job_contents,meta={'item': item}, dont_filter=True)
yield request
except:
break
self.driver.close()
Please Help.
When using selenium and scrapy together, after having selenium perform the click I've read the page back for scrapy using
resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
That would go where your HtmlXPathSelector selector line went. All the scrapy code from that point to the end of the routine would then need to refer to resp (page rendered after the click) rather than response (page rendered before the click).
The time.sleep(3) may give you issues as it doesn't guarantee the page has actually loaded, it's just an unconditional wait. It might be better to use something like
WebDriverWait(self.driver, 30).until(test page has changed)
which waits until the page you are waiting for passes a specific test, such as finding the expected page number or manufacturer's part number.
I'm not sure what the impact of closing the driver at the end of every pass through parse() is. I've used the following snippet in my spider to close the driver when the spider is closed.
def __init__(self, filename=None):
# wire us up to selenium
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
Selenium isn't in any way connected with scrapy, nor their response object, and in your code I don't see you changing the response object.
You'll have to work with them independently.

Django- Redirect to same view without reloading

I have a view (views.loaditems) which runs some algorithm and passes items to a template(product.html) where the items are loaded, and with each item, I have a "add_to_shortlist" link. On clicking this link, the item is added in the user's shortlist (for which I have a function). I want that on click, the page is not reloaded and has its items, but simply add that item to the user's shortlist. Also, where should I define this shortlist function?
I'm new to Django, and any help would be much appreciated. Thanks.
Update: Here's my code:
views.py
def loaditems(request):
#some code
ourdeals = SDeals.objects.filter(**{agestring3:0})
sorteddeals = ourdeals.order_by('-total_score')
user = request.user
context = {'deals': sorteddeals, 'sl_products':sl_products, 'user':user,}
template='index.html'
return render_to_response(template, context, context_instance=RequestContext(request))
def usersl(request, id, id2):
userslt = User_Shortlist.objects.filter(id__iexact=id)
products = SDeals.objects.filter(id__iexact=id2)
product = products[0]
if userslt:
userslt[0].sdeals.add(product)
sl = userslt[0].sdeals.all()
return render_to_response('slnew.html', {'sl':sl}, context_instance=RequestContext(request))
in my index.html I have:
<div class="slist"></div>
which in urls.py takes me to views.usersl:
url(r'^usersl/(?P<id>\d+)/(?P<id2>\d+)/$', views.usersl),
I don't want to go to slnew.html, instead be on index.html without reloading it, and on click 'slist', just run the function to add to shortlist.
In order to make changes on the server and in a page without navigating with the browser you need to look at JavaScript solutions. Read up about Ajax. In essence you need to use some JavaScript to send the update to the server, and to change the HTML.
JQuery is one popular library that will help you to do this. A more sophisticated example is AngularJS. On the Django side you'll write some views that handle these small update tasks used in the page. Libraries like Django REST framework or Django Slumber will help you with that.