I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy
I want to add a link in an admin form to show the children of the clicked object.
In my model I have:
def change_form_link(self):
changeform_url = urlresolvers.reverse('admin:customerData_wifirouter_changelist'
return '<a href="%s" >Change</a>' % changeform_url
change_form_link.allow_tags = True
Everything works fine here, and I get my link to the admin page to change my "wifirouters".
But I need to filter this page by building.
So I tried:
def change_form_link(self):
changeform_url = urlresolvers.reverse('admin:customerData_wifirouter_changelist', args=[self.building_label,])
return '<a href="%s" >Change</a>' % changeform_url
change_form_link.allow_tags = True
And I get a bad error:
NoReverseMatch at /admin/customerData/customer/1/
Reverse for 'customerData_wifirouter_changelist' with arguments '('4 rue de Douai',)' and keyword arguments '{}' not found. 1 pattern(s) tried: ['admin/customerData/wifirouter/$']
On the other hand, my admin filtered page works fine at http://127.0.0.1:8000/admin/customerData/wifirouter/?building__building_label=4+rue+de+Douai .
I understand that I am using the bas syntax to link to the filtered admin page.
Advice?
The querystring ?building__building_label=4+rue+de+Douai is not part of the url that is reversed. You can reverse the url, then add the querystring to it.
from urllib.parse import urlencode
# In Python 2 from urllib import urlencode
changeform_url = urlresolvers.reverse('admin:customerData_wifirouter_changelist')
querydict = {'building__building_label': self.building_label}
changeform_url = '%s?%s' % (changeform_url, urlencode(querydict))
Using urlencode ensures that the string is url encoded (e.g. the spaces are converted to + signs).
Here is what I have currently:
urls.py:
...
url(r'this/is/relative', 'myapp.views.callview', name='myapp_callview'),
...
views.py:
def callview(request, **kwargs):
# I can get the complete url by doing this
print request.build.absolute_uri() # Prints: https://domain:8080/myapp/this/is/relative
# How do I just get: /myapp/this/is/relative or even /this/is/relative
I would like to extract the relative uri from the view. I could just use regex, but I think there is already something out there that would let me do this.
This will give you "/myapp/this/is/relative":
from django.core import urlresolvers
relative_uri = urlresolvers.reverse("myapp_callview")
Link to Django docs page: https://docs.djangoproject.com/en/dev/ref/urlresolvers/
I'm newbie in Django tests. How to create Unit Test for this views function? My unit test function should import function from views? Please an example. This will help me to understand how it work
#maintainance_job
def time_to_end(request):
today = datetime.date.today()
datas = Data.objects.filter(start__lte=today,
other_date__gte=today)
for data in datas:
subject = _(u'Send email')
body = render_to_string('mail.txt',
{'data': data})
email = EmailMessage(subject, body,
'admin#admin.com',
[data.user.email])
email.send()
return HttpResponse('Done')
urls:
(r'^maintainance/jobs/time_to_end/$', 'content.views.time_to_end'),
There is a simpliest test for your case (place it in tests.py of a directory where is your view function):
from django.utils import unittest
from django.test.client import Client
class HttpTester( unittest.TestCase ):
def setUp( self ):
self._client = Client() # init a client for local access to pages of your site
def test_time_to_end( self ):
response = self._client.get( '/jobs/time_to_end/' )
# response = self._client.post( '/jobs/time_to_end/' ) - a 'POST' request
result = response.content
assert result != 'Done'
So, we use self._client to make 'get' and 'post' requests. Responses can be accessed by reading response.content (the full text of response) or by reading response.context if you use templates and want to access variables passing to the templates.
For example if your view normally must pass the dict with context variable 'result' to template:
{ 'result': "DONE" }
then you could check your result:
result = response.context[ 'result' ]
assert result != 'Done'
So, you wait your test will have the 'result' variable and it will be 'Done'. Otherwise you raise AssertionError (note assert statement).
If there is an exception then tests fails. AssertionError is an exception too.
More details - in the docs and in a book "Dive into Python".
I have URLs like http://example.com/depict?smiles=CO&width=200&height=200 (and with several other optional arguments)
My urls.py contains:
urlpatterns = patterns('',
(r'^$', 'cansmi.index'),
(r'^cansmi$', 'cansmi.cansmi'),
url(r'^depict$', cyclops.django.depict, name="cyclops-depict"),
I can go to that URL and get the 200x200 PNG that was constructed, so I know that part works.
In my template from the "cansmi.cansmi" response I want to construct a URL for the named template "cyclops-depict" given some query parameters. I thought I could do
{% url cyclops-depict smiles=input_smiles width=200 height=200 %}
where "input_smiles" is an input to the template via a form submission. In this case it's the string "CO" and I thought it would create a URL like the one at top.
This template fails with a TemplateSyntaxError:
Caught an exception while rendering: Reverse for 'cyclops-depict' with arguments '()' and keyword arguments '{'smiles': u'CO', 'height': 200, 'width': 200}' not found.
This is a rather common error message both here on StackOverflow and elsewhere. In every case I found, people were using them with parameters in the URL path regexp, which is not the case I have where the parameters go into the query.
That means I'm doing it wrong. How do I do it right? That is, I want to construct the full URL, including path and query parameters, using something in the template.
For reference,
% python manage.py shell
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from django.core.urlresolvers import reverse
>>> reverse("cyclops-depict", kwargs=dict())
'/depict'
>>> reverse("cyclops-depict", kwargs=dict(smiles="CO"))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Library/Python/2.6/site-packages/django/core/urlresolvers.py", line 356, in reverse
*args, **kwargs)))
File "/Library/Python/2.6/site-packages/django/core/urlresolvers.py", line 302, in reverse
"arguments '%s' not found." % (lookup_view_s, args, kwargs))
NoReverseMatch: Reverse for 'cyclops-depict' with arguments '()' and keyword arguments '{'smiles': 'CO'}' not found.
Building an url with query string by string concatenation as suggested by some answers is as bad idea as building SQL queries by string concatenation. It is complicated, unelegant and especially dangerous with a user provided (untrusted) input. Unfortunately Django does not offer an easy possibility to pass query parameters to the reverse function.
Python standard urllib however provides the desired query string encoding functionality.
In my application I've created a helper function:
def url_with_querystring(path, **kwargs):
return path + '?' + urllib.urlencode(kwargs) # for Python 3, use urllib.parse.urlencode instead
Then I call it in the view as follows:
quick_add_order_url = url_with_querystring(reverse(order_add),
responsible=employee.id, scheduled_for=datetime.date.today(),
subject='hello world!')
# http://localhost/myapp/order/add/?responsible=5&
# scheduled_for=2011-03-17&subject=hello+world%21
Please note the proper encoding of special characters like space and exclamation mark!
Your regular expresion has no place holders (that's why you are getting NoReverseMatch):
url(r'^depict$', cyclops.django.depict, name="cyclops-depict"),
You could do it like this:
{% url cyclops-depict %}?smiles=CO&width=200&height=200
URLconf search does not include GET or POST parameters
Or if you wish to use {% url %} tag you should restructure your url pattern to something like
r'^depict/(?P<width>\d+)/(?P<height>\d+)/(?P<smiles>\w+)$'
then you could do something like
{% url cyclops-depict 200 200 "CO" %}
Follow-up:
Simple example for custom tag:
from django.core.urlresolvers import reverse
from django import template
register = template.Library()
#register.tag(name="myurl")
def myurl(parser, token):
tokens = token.split_contents()
return MyUrlNode(tokens[1:])
class MyUrlNode(template.Node):
def __init__(self, tokens):
self.tokens = tokens
def render(self, context):
url = reverse('cyclops-depict')
qs = '&'.join([t for t in self.tokens])
return '?'.join((url,qs))
You could use this tag in your templates like so:
{% myurl width=200 height=200 name=SomeName %}
and hopefully it should output something like
/depict?width=200&height=200&name=SomeName
I recommend to use builtin django's QueryDict. It also handles lists properly. End automatically escapes some special characters (like =, ?, /, '#'):
from django.http import QueryDict
from django.core.urlresolvers import reverse
q = QueryDict('', mutable=True)
q['some_key'] = 'some_value'
q.setlist('some_list', [1,2,3])
'%s?%s' % (reverse('some_view_name'), q.urlencode())
# '/some_url/?some_list=1&some_list=2&some_list=3&some_key=some_value'
q.appendlist('some_list', 4)
q['value_with_special_chars'] = 'hello=w#rld?'
'%s?%s' % (reverse('some_view_name'), q.urlencode())
# '/some_url/?value_with_special_chars=hello%3Dw%23rld%3F&some_list=1&some_list=2&some_list=3&some_list=4&some_key=some_value'
To use this in templates you will need to create custom template tag
Working variation of previous answers and my experience with dealing this stuff.
from django.urls import reverse
from django.utils.http import urlencode
def build_url(*args, **kwargs):
params = kwargs.pop('params', {})
url = reverse(*args, **kwargs)
if params:
url += '?' + urlencode(params)
return url
How to use:
>>> build_url('products-detail', kwargs={'pk': 1}, params={'category_id': 2})
'/api/v1/shop/products/1/?category_id=2'
The answer that used urllib is indeed good, however while it was trying to avoid strings concatenation, it used it in path + '?' + urllib.urlencode(kwargs). I believe this may create issues when the path has already some query parmas.
A modified function would look like:
def url_with_querystring(url, **kwargs):
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(kwargs)
url_parts[4] = urllib.urlencode(query)
return urlparse.urlunparse(url_parts)
Neither of the original answers addresses the related issue resolving URLs in view code. For future searchers, if you are trying to do this, use kwargs, something like:
reverse('myviewname', kwargs={'pk': value})