Scrapy FormRequest authentication with user, password plus form_key - python-2.7

So I need to parse a set of pages from an authenticated user's point of view and get some values from there...
So I need scrapy to authenticate itself before starting to parse.
In order to do that I understand that I would need to use InitSpider and init_request.
My problem is that in order to authenticate, using a FormRequest, I must send to the form a certain value that is some kind of a session id (form_key) that is generated automatically.
So first, scrapy must access the self.login_page, get the value of form_key (like in the xpath bellow), then post the LOGIN FORM to self.login_post
The 2 URL's are described bellow in the script section
The HTML containing my form key looks like this:
<input name="form_key" type="hidden" value="2pos7YUekQz6Y9rD">
So in order to obtain my form_key value, I did this (updated scrapy script):
class MyScript(InitSpider):
name = "logged-in"
allowed_domains = ["example.com"]
login_post = 'https://www.example.com/customer/account/loginPost/'
def init_request(self):
""" Called before crawler starts """
return Request(url=self.login_post, callback=self.login, dont_filter = True)
def login(self, response):
""" Generate login request """
login_page = 'https://www.example.com/customer/account/login/'
myformkey= self.pax_key(login_page)
return FormRequest.from_response(response,
formxpath="//div[#class='account-login']/form[#id='login-form']",
formdata={'login[username]':'user',
'login[password]':'pass',
'form_key':myformkey
},
callback=self.check_login_response)
def check_login_response(self,response):
""" Check the response returned by login request to see if we are logged in """
logging.log(logging.INFO,'... checking login response ...')
if "Invalid login or password." in response.body:
logging.log(logging.INFO,'... BAD LOGIN ...')
else:
logging.log(logging.INFO, 'GOOD LOGIN... initialize:'+response.url)
self.initialized()
def pax_key(self, url):
data = urllib.urlopen(url).read()
hxs = Selector(text=data) #HtmlXPath
lista = hxs.xpath('//input[#type="hidden"][#name="form_key"]/#value').extract()
logging.log(logging.INFO,lista)
return lista
def parse(self, response):
...
So the problem now is...:
INFO: do login...
2015-12-04 21:57:18 [root] INFO: [u'yFVS9fR8Jteuo6Co']
2015-12-04 21:57:19 [root] INFO: http://www.example.com/enable-cookies
2015-12-04 21:57:19 [scrapy] ERROR: Spider error processing <GET http://www.example.com/enable-cookies> (referer: None)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "d:\Calin\WorkSpace\example-spider\products\spiders\logged-in.py", line 49, in login
callback=self.check_login_response)
File "c:\python27\lib\site-packages\scrapy\http\request\form.py", line 37, in from_response
form = _get_form(response, formname, formnumber, formxpath)
File "c:\python27\lib\site-packages\scrapy\http\request\form.py", line 81, in _get_form
raise ValueError('No <form> element found with %s' % formxpath)
ValueError: No <form> element found with //div[#class='account-login']/form[#id='login-form']
So first of all... I have no idea why I get redirected towards the enable-cookies page...
Spider error processing <GET http://www.example.com/enable-cookies>
I do not tell it to go there... at all.
I assume that is why there is no form there to xpath it... so I get the error in the end.
ValueError: No <form> element found with
//div[#class='account-login']/form[#id='login-form']

Related

Keep getting this error when trying to render my custom HTML form

Here is the error I got when I removed the else clause from my views.py file.
File "C:\Users\david\AppData\Local\Programs\Python\Python39\lib\site-packages\django\core\handlers\base.py", line 309, in check_response
raise ValueError(
ValueError: The view register.views.register didn't return an HttpResponse object. It returned None instead.
I'm guessing that I got this error as there is no else clause, which I need to have but if I have it in my views.py file I get another error.
Here is my view file:
from django.shortcuts import render,redirect
from django.contrib.auth.models import User, auth
def register(request):
if request.method == 'POST':
first_name = request.POST['first_name'],
last_name = request.POST['last_name'],
dob = request.POST['dob'],
email = request.POST['email']
newuser= User.objects.create_user(first_name=first_name,last_name=last_name, email=email)
newuser.save()
print(request.POST)
return redirect('/home')
I originally had an else clause at the bottom of this but that caused a different error.
Here is that clause:
else:
return render('/register/userinfo.html')
And the error I get with it in place:
File "C:\Users\david\OneDrive\Desktop\Django\Sub\register\views.py", line 4, in render
return render(request,'/register/userinfo.html')
TypeError: render() takes 1 positional argument
but 2 were given
Any idea as to what may cause this bottom error?
Edit 1: Just added in a request clause to the return render and got another error, this time:
File "C:\Users\david\AppData\Local\Programs\Python\Python39\lib\site-packages\django\template\loader.py", line 19, in get_template
raise TemplateDoesNotExist(template_name, chain=chain)
django.template.exceptions.TemplateDoesNotExist: register/userinfo.html
Looks like it can't find my template I used for the form. It's path is set up so that it is in a template folder within my app of register.
Edit 2 Got the form to show up, but when I submit I get this Page not found error:
Page not found (404)
Request Method: POST
Request URL: http://127.0.0.1:8000/register/register
Using the URLconf defined in Main.urls, Django tried these URL patterns, in this order:
admin/
home/
about/
addinfo/
register/ [name='register']
The current path, register/register, didn’t match any of these.
You have to check your path to userinfo.html. You should not start with /. Try this:
return render(request,'userinfo.html')
The second error message suggests that your code on line 4 was this:
return render(request,'/register/userinfo.html')
You say you else clause was:
else:
return render('/register/userinfo.html')
Are you sure you saved and restarted your project before testing.
Else I have no idea how this error could occur.
I suggest checking out the code of the render function to see what may cause the problem.

Scrapy post request form data

I want to get the search result using scrapy post request after giving the input to CP Number as 16308 https://www.icsi.in/Student/Default.aspx?TabID=100 .
Here is my scrapy spider code :--
def parse(self, response):
head=response.xpath('//span[#id="dnn_ctlHeader_dnnBreadcrumb_lblBreadCrumb"]/span[#class="SkinObject"]/text()').extract_first()
view_gen = response.xpath('//input[#id="__VIEWSTATEGENERATOR"]/#value').extract_first()
dnn= response.xpath('//input[#id="__dnnVariable"]/#value').extract_first()
view_state = response.xpath('//input[#id="__VIEWSTATE"]/#value').extract_first()
view_val = response.xpath('//input[#id="__EVENTVALIDATION"]/#value').extract_first()
data={
'__VIEWSTATEGENERATOR':view_gen,
'__dnnVariable':dnn,
'__VIEWSTATE':view_state,
'__EVENTVALIDATION':view_val,
'dnn$ctr410$MemberSearch$txtCpNumber':'16803',
'dnn$ctr410$MemberSearch$ddlMemberType':'0'
}
yield scrapy.FormRequest(response.url,formdata=data,callback=self.fun)
Response
DEBUG: Crawled (200) https://www.icsi.in/Student/Default.aspx?tabid=100&error=An%20unexpected%20error%20has%20occurred&content=0> (referer: https://www.icsi.in/Student/Default.aspx?TabID=100)
[]
Response DEBUG: Crawled (200)
https://www.icsi.in/Student/Default.aspx?tabid=100&error=An%20unexpected%20error%20has%20occurred&content=0>
(referer: https://www.icsi.in/Student/Default.aspx?TabID=100) []
Your question is how to avoid getting this error right? Try to be more specific in the future.
When you want to scrape a webpage you have to inspect it all on your browser, see all parameters that are being sent with the request and make sure you are doing the same on your spider. You got lots of parameters on your code, but not all.
See my code bellow which actually solves your problem:
import scrapy
class MySpider(scrapy.Spider):
name = 'icsi'
start_urls = ['https://www.icsi.in/Student/Default.aspx?TabID=100']
search_action_url = 'https://www.icsi.in/Student/Default.aspx?TabID=100'
def parse(self, response):
formdata = dict()
for input in response.css('form#Form input'):
name = input.xpath('./#name').get()
value = input.xpath('./#value').get()
formdata[name] = str(value) if value else ''
formdata['dnn$ctr410$MemberSearch$txtCpNumber'] = '16308'
formdata['__EVENTTARGET'] = 'dnn$ctr410$MemberSearch$btnSearch'
return scrapy.FormRequest(self.search_action_url, formdata=formdata, callback=self.parse_search)
def parse_search(self, response):
scrapy.shell.inspect_response(response, self)
return
You were missing the parameter __EVENTTARGET, which informs the site you hit the button "Search".

Scrapy scrapes from only some pages -- Crawled(200) (referer: None) Error?

I have written a scrapy project to scrape some data from the Congress.gov website. Originally, I was hoping to scrape the data on all bills. My code ran, and downloaded the data I wanted but only for about 1/2 the bills. So I began troubleshooting. I turned on the autothrottle in the settings, and included middleware code for too many requests. I then limited the search criteria to just a particular congress (97th) for just bills originating in the Senate, and re-ran the code. It downloaded most of the bills, but again some were missing. I then tried to scrape just the pages that were missing. In particular, I tried scraping page 32 I was able to scrape successfully. So why won't it scrape all the pages when I use the recursive code?
Can anyone help me to figure out what the problem is? Here is the code I used to scrape the info from all bills in the 97th congress:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from senatescraper.items import senatescraperSampleItem
from scrapy.http.request import Request
class SenatebotSpider(BaseSpider):
name = 'recursivesenatetablebot2'
allowed_domains = ['www.congress.gov']
def start_requests(self):
baseurl = "https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22chamber%22%3A%22Senate%22%2C%22congress%22%3A%5B%2297%22%5D%2C%22type%22%3A%5B%22bills%22%5D%7D&page="
for i in xrange(1,32):
beginurl= baseurl + `i`
yield Request(beginurl, self.parse_bills)
def parse_bills(self, response):
sel= Selector(response)
bills=sel.xpath("//span[5][#class='result-item']")
for bill in bills:
bill_url=bill.css("span.result-item a::attr(href)").extract()[0]
yield Request(url=bill_url, callback=self.parse_items)
def parse_items(self, response):
sel=Selector(response)
rows=sel.css('table.item_table tbody tr')
items=[]
for row in rows:
item = senatescraperSampleItem()
item['bill']=response.css('h1.legDetail::text').extract()
item['dates']=row.xpath('./td[1]/text()').extract()[0]
item['actions']=row.css('td.actions::text').extract()
item['congress']=response.css('h2.primary::text').extract()
items.append(item)
return items
This is the code I used to just scrape page 32 of search with filters for the 97th congress, bills originating in the senate only:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from startingover.items import startingoverSampleItem
class DebuggingSpider(BaseSpider):
name = 'debugging'
allowed_domains = ['www.congress.gov']
def start_requests(self):
yield scrapy.Request('https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22chamber%22%3A%22Senate%22%2C%22congress%22%3A%5B%2297%22%5D%2C%22type%22%3A%5B%22bills%22%5D%7D&page=32', self.parse_page)
def parse_page(self, response):
sel= Selector(response)
bills=sel.xpath("//span[5][#class='result-item']")
for bill in bills:
bill_url=bill.css("span.result-item a::attr(href)").extract()[0]
yield Request(url=bill_url, callback=self.parse_items)
def parse_items(self, response):
sel=Selector(response)
rows=sel.css('table.item_table tbody tr')
items=[]
for row in rows:
item = startingoverSampleItem()
item['bill']=response.css('h1.legDetail::text').extract()
item['dates']=row.xpath('./td[1]/text()').extract()[0]
item['actions']=row.css('td.actions::text').extract()
item['congress']=response.css('h2.primary::text').extract()
items.append(item)
return items
And my item:
from scrapy.item import Item, Field
class senatescraperSampleItem(Item):
bill=Field()
actions=Field(serializer=str)
congress=Field(serializer=str)
dates=Field()
I think you don't see half of the things you are trying to scrap because you are not taking care of resolving relative urls. Using response.urljoin should remedy the situation.
yield Request(url=response.urljoin(bill_url), callback=self.parse_items)
You may experience this exception:
2018-01-30 17:27:13 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22chamber%22%3A%2
2Senate%22%2C%22congress%22%3A%5B%2297%22%5D%2C%22type%22%3A%5B%22bills%22%5D%7D&page=5> (referer: None)
Traceback (most recent call last):
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
for x in result:
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/jorge/.virtualenvs/scrapy/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/tmp/stackoverflow/senatescraper/senatescraper/spiders/senatespider.py", line 25, in parse_bills
bill_url = bill.css("span.result-item a::attr(href)").extract()[0]
IndexError: list index out of range
To ensure you are getting the URL from the element with text "All actions" and don't caught anything weird that may exists before that element you should combine your xpath query as follows:
def parse_bills(self, response):
sel = Selector(response)
bills = sel.xpath(
'//a[contains(#href, "all-actions")]/#href').extract()
for bill in bills:
yield Request(
url=response.urljoin(bill),
callback=self.parse_items,
dont_filter=True)
Note the dont_filter=True argument, I added it as scrapy was filtering links I already crawled (this is the default configuration). You can removed it if you manage the filtering of duplicate links in a different manner.
When you get exceptions, you can always wrap them around try and except and start the debugging shell of scrapy in the except block, it will help you inspect the response and see what's going on.
I made the following change to my code and it worked perfectly:
def parse_bills(self, response):
bills=Selector(response)
billlinks=bills.xpath('//a[contains(#href,"/all-actions")]/#href')
for link in billlinks:
urllink=link.extract()
yield Request(url=urllink, callback=self.parse_items)

Include django logged user in django Traceback error

What is the easist way to include username, first and last name and e-amil in django Traceback error.
I know that the way is create a custom error report:
Create a new class that innherit from django.views.debug.SafeExceptionReporterFilter
Set DEFAULT_EXCEPTION_REPORTER_FILTER
But, what method a should overwrite to receive traceback with also this information?
I would like that my treceback look likes:
Traceback (most recent call last):
File "/usr...o/core/handlers/base.py", line 89, in get_response
response = middleware_method(request)
File "/.../g...ap/utils/middleware.py", line 23,...
if elapsedTime.min > 15:
TypeError: can't compare datetime.timedelta to int
Logged user information:
User: pepito
name: Pepito Grillo
e-mail: grillo#peppeto.com
I did it using Custom Middleware. I'm not sure this is the best answer, but it is how I solved it for my project.
settings.py:
MIDDLEWARE_CLASSES = (
...
'utilities.custom_middleware.CustomMiddleware',
...
)
utilities/custom_middleware.py:
from utilities.request import AddRequestDetails
class CustomMiddleware(object):
"""
Adds user details to request context during request processing, so that they
show up in the error emails. Add to settings.MIDDLEWARE_CLASSES and keep it
outermost(i.e. on top if possible). This allows it to catch exceptions in
other middlewares as well.
"""
def process_exception(self, request, exception):
"""
Process the request to add some variables to it.
"""
# Add other details about the user to the META CGI variables.
try:
if request.user.is_authenticated():
AddRequestDetails(request)
request.META['AUTH_VIEW_ARGS'] = str(view_args)
request.META['AUTH_VIEW_CALL'] = str(view_func)
request.META['AUTH_VIEW_KWARGS'] = str(view_kwargs)
except:
pass
utilities/request.py:
def AddRequestDetails(request):
"""
Adds details about the user to the request, so any traceback will include the
details. Good for troubleshooting; this will be included in the email sent to admins
on error.
"""
if request.user.is_anonymous():
request.META['AUTH_NAME'] = "Anonymous User"
request.META['AUTH_USER'] = "Anonymous User"
request.META['AUTH_USER_EMAIL'] = ""
request.META['AUTH_USER_ID'] = 0
request.META['AUTH_USER_IS_ACTIVE'] = False
request.META['AUTH_USER_IS_SUPERUSER'] = False
request.META['AUTH_USER_IS_STAFF'] = False
request.META['AUTH_USER_LAST_LOGIN'] = ""
else:
request.META['AUTH_NAME'] = str(request.user.first_name) + " " + str(request.user.last_name)
request.META['AUTH_USER'] = str(request.user.username)
request.META['AUTH_USER_EMAIL'] = str(request.user.email)
request.META['AUTH_USER_ID'] = str(request.user.id)
request.META['AUTH_USER_IS_ACTIVE'] = str(request.user.is_active)
request.META['AUTH_USER_IS_SUPERUSER'] = str(request.user.is_superuser)
request.META['AUTH_USER_IS_STAFF'] = str(request.user.is_staff)
request.META['AUTH_USER_LAST_LOGIN'] = str(request.user.last_login)
My trivial solution (works in django 1.5)
settings.py:
MIDDLEWARE_CLASSES = (
...
'utilities.custom_middleware.UserTracebackMiddleware',
...
)
custom_middleware.py:
class UserTracebackMiddleware(object):
"""
Adds user to request context during request processing, so that they
show up in the error emails.
"""
def process_exception(self, request, exception):
if request.user.is_authenticated():
request.META['AUTH_USER'] = unicode(request.user.username)
else:
request.META['AUTH_USER'] = "Anonymous User"
hope it helps

google apps engine django form pyamf

My flash-pyamf-gae works nice.
Now I would like to create a classic Django form by following the google tutorial : http://code.google.com/appengine/articles/djangoforms.html
I did but when I post the data entered in my form i get the following message from pyamf :
Malformed stream (amfVersion=110)
400 Bad Request The request body was
unable to be successfully decoded.
Traceback:
Traceback (most recent call last):
File
"C:\Users\Giil\Documents\dev\gae\moviesbuilder\pyamf\remoting\gateway\google.py",
line 79, in post
logger=self.logger, timezone_offset=timezone_offset)
File
"C:\Users\Giil\Documents\dev\gae\moviesbuilder\pyamf\remoting_init_.py",
line 640, in decode
msg.amfVersion) DecodeError: Malformed stream (amfVersion=110)Malformed stream (amfVersion=110)
Now that make sens to me because what I send from my form is not amf. How can I deal with this ?
Note : I have the feeling that the problems come from my app.yaml
I have no special handler to tell the application to process this form differently...Malformed stream (amfVersion=110)
I solved the problem my own way :
My Form (the post is directed towards another function and not just "/" as in google example) :
class Projects(webapp.RequestHandler):
def get(self):
self.response.out.write('<html><body>'
'<form method="POST" '
'action="/ProjectsPage">'
'<table>')
self.response.out.write(ProjectForm())
self.response.out.write('</table>'
'<input type="submit">'
'</form></body></html>')
And then what I need to write to the dataStore and display the list :
class ProjectsPage(webapp.RequestHandler):
#getting data and saving
def post(self):
data = ProjectForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put()
self.redirect('/projects.html')
else:
# Reprint the form
self.response.out.write('<html><body>'
'<form method="POST" '
'action="/">'
'<table>')
self.response.out.write(data)
self.response.out.write('</table>'
'<input type="submit">'
'</form></body></html>')
#display list of projects
def get(self):
query = db.GqlQuery("SELECT * FROM Project WHERE added_by=:1 ORDER BY name",users.get_current_user())
template_values = {
'projects': query,
}
path = os.path.join(os.path.dirname(__file__), 'templates/projects.html')
self.response.out.write(template.render(path, template_values))