Using django 1.7 and python 2.7, in a views I have:
page = 0
sex = [u'\u0632\u0646'] #sex = زن
url = "/result/%s/%d" % (sex, page)
return HttpResponseRedirect(url)
Which needs to return:
/result/زن/0
However the resulting url turns out to be:
/result/[u'\u0632\u0646']/0
Which is not what envisage in the pattern:
`url(r'^result/(?P<sex>\w+)/(?P<page>\d+)','userprofile.views.profile_search_result')`,
I also tried
return HttpResponseRedirect( iri_to_uri(url))
but does not solve the problem.
I got really confused and appreciate your help to fix this.
Since sex is a list, you simply need to use the actual element you want:
url = "/result/%s/%d" % (sex[0], page)
Although note that to construct URLs in Django, you should really use the reverse function:
from django.core.urlresolvers import reverse
...
url = reverse('userprofile.views.profile_search_result', kwargs={'sex': sex[0], 'page': page})
url should also be an unicode string for that to work:
page = 0
sex = u'\u0632\u0646' #sex=زن
url = u"/result/%s/%d" % (sex, page)
return HttpResponseRedirect(url)
Related
I am scraping the web-site "www.accell-group.com" using the "scrapy" library for Python. The site is scraped completely, in total 131 pages (text/html) and 2 documents (application/pdf) are identified. Scrapy did not throw any warnings or errors. My algorithm is supposed to scrape every single link. I use CrawlSpider.
However, when I look into the page "http://www.accell-group.com/nl/investor-relations/jaarverslagen/jaarverslagen-van-accell-group.htm", which is reported by "scrapy" as scraped/processed, I see that there are more pdf-documents, for example "http://www.accell-group.com/files/4/5/0/1/Jaarverslag2014.pdf". I cannot find any reasons for it not to be scraped. There is no dynamic/JavaScript content on this page. It is not forbidden in "http://www.airproducts.com/robots.txt".
Do you maybe have any idea why it can happen?
It is maybe because the "files" folder is not in "http://www.accell-group.com/sitemap.xml"?
Thanks in advance!
My code:
class PyscrappSpider(CrawlSpider):
"""This is the Pyscrapp spider"""
name = "PyscrappSpider"
def__init__(self, *a, **kw):
# Get the passed URL
originalURL = kw.get('originalURL')
logger.debug('Original url = {}'.format(originalURL))
# Add a protocol, if needed
startURL = 'http://{}/'.format(originalURL)
self.start_urls = [startURL]
self.in_redirect = {}
self.allowed_domains = [urlparse(i).hostname.strip() for i in self.start_urls]
self.pattern = r""
self.rules = (Rule(LinkExtractor(deny=[r"accessdenied"]), callback="parse_data", follow=True), )
# Get WARC writer
self.warcHandler = kw.get('warcHandler')
# Initialise the base constructor
super(PyscrappSpider, self).__init__(*a, **kw)
def parse_start_url(self, response):
if (response.request.meta.has_key("redirect_urls")):
original_url = response.request.meta["redirect_urls"][0]
if ((not self.in_redirect.has_key(original_url)) or (not self.in_redirect[original_url])):
self.in_redirect[original_url] = True
self.allowed_domains.append(original_url)
return self.parse_data(response)
def parse_data(self, response):
"""This function extracts data from the page."""
self.warcHandler.write_response(response)
pattern = self.pattern
# Check if we are interested in the current page
if (not response.request.headers.get('Referer')
or re.search(pattern, self.ensure_not_null(response.meta.get('link_text')), re.IGNORECASE)
or re.search(r"/(" + pattern + r")", self.ensure_not_null(response.url), re.IGNORECASE)):
logging.debug("This page gets processed = %(url)s", {'url': response.url})
sel = Selector(response)
item = PyscrappItem()
item['url'] = response.url
return item
else:
logging.warning("This page does NOT get processed = %(url)s", {'url': response.url})
return response.request
Remove or expand appropriately your "allowed_domains" variable and you should be fine. All the URLs the spider follows, by default, are restricted by allowed_domains.
EDIT: This case mentions particularly pdfs. PDFs are explicitly excluded as extensions as per the default value of deny_extensions (see here) which is IGNORED_EXTENSIONS (see here).
To allow your application to crawl PDFs all you have to do is to exclude them from IGNORED_EXTENSIONS by setting explicitly the value for deny_extensions:
from scrapy.linkextractors import IGNORED_EXTENSIONS
self.rules = (Rule(...
LinkExtractor(deny=[r"accessdenied"], deny_extensions=set(IGNORED_EXTENSIONS)-set(['pdf']))
..., callback="parse_data"...
So, I'm afraid, this is the answer to the question "Why does Scrapy miss some links?". As you will likely see it just opens the doors to further questions, like "how do I handle those PDFs" but I guess this is the subject of another question.
Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url
From my template:
July
URL Pattern:
url(r'^ocal/$', views.calendar, name = "othermonth"),
View:
def calendar(request, year, month):
my_year = int(year)
my_month = int(month)
my_calendar_from_month = datetime(my_year, my_month, 1)
my_calendar_to_month = datetime(my_year, my_month, monthrange(my_year, my_month)[1])
my_tickets = Event.objects.filter(on_sale__gte=my_calendar_from_month).filter(on_sale__lte=my_calendar_to_month)
my_previous_year = my_year
my_previous_month = my_month - 1
if my_previous_month == 0:
my_previous_year = my_year - 1
my_previous_month = 12
my_next_year = my_year
my_next_month = my_month + 1
if my_next_month == 13:
my_next_year = my_year + 1
my_next_month = 1
my_year_after_this = my_year + 1
my_year_before_this = my_year - 1
cal = TicketCalendar(my_tickets).formatmonth(year, month)
return render_to_response('calendar.html', {'events_list': my_tickets,
'calendar': mark_safe(cal),
'month': my_month,
'month_name': named_month(my_month),
'year': my_year,
'previous_month': my_previous_month,
'previous_month_name': named_month(my_previous_month),
'previous_year': my_previous_year,
'next_month': my_next_month,
'next_month_name': named_month(my_next_month),
'next_year': my_next_year,
'year_before_this': my_year_before_this,
'year_after_this': my_year_after_this,
}, context_instance=RequestContext(request))
Error:
Reverse for 'othermonth' with arguments '(2013, 7)' and keyword arguments '{}' not found.
I've searched through stackoverflow and the django documentation but I can't seem to figure out why I'm getting this NoReverseMatch error. I'm sure its a very simple oversight on my part because I'm staring at code from a previous project which is almost identical to this and that works fine. Any help would be appreciated, thanks.
UPDATE: I tried removing the parameters that I was trying to send with the URL and that fixed the NoReverseMatch however the view that is called requires those arguments so the link fails.
How do you plan for those arguments to be embedded in your URL? There's nothing capturing them, and no way for the reverse lookup to construct it. You need a URL pattern that accepts these parameters. Something like:
url(r'^ocal/(?P<year>\d{4})/(?P<month>(0|1)?\d)/$', views.calendar, name = "othermonth_bymonth"),
Using keyword rather than positional arguments here is optional, but I think it makes things easier - and allows setting default values that can trigger behavior like showing a calendar for the current day when nothing's specified.
Also, your mytickets queryset can also be constructed thus:
mytickets = Event.objects.filter(on_sale__year=year, onsale__month=month)
Which I think is a little cleaner to read.
Actually, upon a closer look - Django's built in date based views can do a lot of this for you. If you haven't already looked into them, I recommend doing so:
https://docs.djangoproject.com/en/dev/ref/class-based-views/generic-date-based/
For this particular view, all you'd need to do is subclass MonthArchiveView to create your TicketCalendar instance and add it to your context.
Edit: OK, you're still getting problems. This is how I would go about solving this:
views.py
class TicketMonthArchiveView(MonthArchiveView):
allow_empty = True # show months even in which no tickets exist
allow_future = True # show future months
model = Event
def get_context_data(self, **kwargs):
base_context = super(TicketMonthArchiveView, self).get_context_data(**kwargs)
my_tickets = kwargs['object_list']
base_context['calendar'] = mark_safe(TicketCalendar(my_tickets).formatmonth(self.get_year(), self.get_month()))
# the above could be a little off because I don't know exactly how your formatmonth method works
return base_context
urls.py
from somewhere.views import TicketMonthArchiveView
# other patterns, plus:
url(r'^ocal/(?P<year>\d{4})/(?P<month>(0|1)?\d)/$', TicketMonthArchiveView.as_view(), name='calendar_by_month'),
template event_archive_month.html
{{ next_month|date:'%f' }}
Obviously there's a lot more you could do here with the year and day views, but this should demonstrate the general concepts.
Large context volumes caused this behavior for me as well, despite having the correct syntax.
I had this, on Django 1.5:
urls.py
url(r'(?P<CLASSID>[A-Z0-9_]+)/$', 'psclassdefn.views.psclassdefn_detail', name="psclassdefn_detail"),
template.html
{% for inst in li_inst %}
<li><a href={% url 'psclassdefn.views.psclassdefn_detail' CLASSID=inst.CLASSID %}></a></li>
{% endfor %}
I kept on getting NoReverseMatch even though the syntax seemed OK and I could reverse into a no-argument url.
Now, li_inst was a huge list of about 1000 rows fetched from the db and passed to the context variable. Just to test, I trimmed the list by removing all but 10 or so rows. And it worked, without any syntax changes.
Maybe the templating system intentionally throttles the context size? I can filter the list if needed, just didn't expect this error to come from it.
I need to get django to send an email which contains a URL like this
http://www.mysite.org/history/
Where 'history' is obtained like so:
history_url = urlresolvers.reverse('satchmo_order_history')
history_url is a parameter that I pass on to the function that sends the email, and it correctly produces '/history/'. But how do I get the first part? (http://www.mysite.org)
Edit 1
Is there anything wrong or unportable about doing it like this? :
history = urlresolvers.reverse('satchmo_order_history')
domain = Site.objects.get_current().domain
history_url = 'http://' + domain + history
If you have access to an HttpRequest instance, you can use HttpRequest.build_absolute_uri(location):
absolute_uri = request.build_absolute_uri(relative_uri)
In alternative, you can get it using the sites framework:
import urlparse
from django.contrib.sites.models import Site
domain = Site.objects.get_current().domain
absolute_uri = urlparse.urljoin('http://{}'.format(domain), relative_uri)
Re: Edit1
I tend to use urlparse.join, because it's in the standard library and it's technically the most Pythonic way to combine URIs, but I think that your approach is fine too.
Here is my url :
url(r'^test/$|test/(\d+)', views.test_page)
So with the django runserver fired I can enter 127.0.0.1:8000/test/ or
the same url followed with a "page" number.
Here is a simplified version of my view :
def test_page(request, pagenumber):
paginator = Paginator(Test.objects.all(), 5)
page = 1
if pagenumber:
page = pagenumber
posts = paginator.page(page)
That works but is kinda inefficient. So I modified it to :
def test_page(request, page=1):
paginator = Paginator(Test.objects.all(), 5)
posts = paginator.page(page)
Which is nicer, works when I specify a page number in the url but when
I just enter 127.0.0.1:8000/test/ it doesn't. I got a :
Exception Type: TypeError
Exception Value: int() argument must be a string or a number,
not 'NoneType'
Exception Location: /usr/lib/python2.7/site-packages/django/core/
paginator.py in validate_number, line 23
why doesn't the attribute page take the default value 1 when I don't
specify any page number ?
For things like page numbers, it's better to use GET parameters, ie the form /test/?page=1. You do this directly in the view via request.GET.get('page'), so the urlconf is just r'^test/$.
Do a check to see if the function parameter page is None in the code. If it is set the value to be 1. Instead of relying on the function to default it.
I have never seen a url entry written that way.
Could you try something like:
To pass a named argument I think you have to specify that arguments name in your urls.py , it doesn't look like you're doing that now.
(?P<page>\d+)/$ this is the way django recommends to pass defualt arguments. https://docs.djangoproject.com/en/dev/topics/http/urls/#notes-on-capturing-text-in-urls
I would try this:
def test_page(request, **kwargs):
page = kwargs.get('pagenumber', 1)
paginator = Paginator(Test.objects.all(), 5)
posts = paginator.page(page)
No shell around so I couldn't test this code.
Please do not try to assign a default value to the argument "page" which you are passing in the function call. Just keep it as:
def test_page(request, page):
Please do try it, and write the url as:
url(r'^test/(?P<page>\d+)/$', views.test_page)