Reset cookies in Scrapy without disabling them - cookies

I use CrawlSpider to crawl a website. The website detects my spider using cookies. If I disable them, it also detects that I am a robot. So how to use new cookies in each request.
My spider is pretty simple :
# -*- coding: utf-8 -*-
import scrapy
import requests
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/items']
rules = (
Rule(LinkExtractor(allow=('/items/.'),deny=('sendMessage')), follow=True),
Rule(LinkExtractor(allow=('/item/[a-z\+]+\-[0-9]+') ,deny=('sendMessage')), callback='parse_item', follow=False),
)
def parse_item(self, response):
#parsing the page et yielding data
PS: I'm using tor to change ip every x seconds.

You can set cookies for each Request via cookies parameter. It's a bit more complicated when using CrawlSpider class, because it generates requests for you based on rules. However, you can add process_request parameter to your Rule. From the documentation:
process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).
Hence, implement that method and add cookies parameter to each request passed in and then return the modified request.

Resolved !
I use the folowing code :
def newsession(self, request):
session_id = random.randint(0,900000)
tagged = request
tagged.meta.update(cookiejar=session_id)
return tagged
In Rule I call newsession function through process_request (thanks Tomáš) :
Rule(LinkExtractor(allow=('/item/[a-z\+]+\-[0-9]+') ,deny=('sendMessage')), process_request='newsession', callback='parse_item', follow=False),

Related

Django cache_page with keys [duplicate]

The #cache_page decorator is awesome. But for my blog I would like to keep a page in cache until someone comments on a post. This sounds like a great idea as people rarely comment so keeping the pages in memcached while nobody comments would be great. I'm thinking that someone must have had this problem before? And this is different than caching per url.
So a solution I'm thinking of is:
#cache_page( 60 * 15, "blog" );
def blog( request ) ...
And then I'd keep a list of all cache keys used for the blog view and then have way of expire the "blog" cache space. But I'm not super experienced with Django so I'm wondering if someone knows a better way of doing this?
This solution works for django versions before 1.7
Here's a solution I wrote to do just what you're talking about on some of my own projects:
def expire_view_cache(view_name, args=[], namespace=None, key_prefix=None):
"""
This function allows you to invalidate any view-level cache.
view_name: view function you wish to invalidate or it's named url pattern
args: any arguments passed to the view function
namepace: optioal, if an application namespace is needed
key prefix: for the #cache_page decorator for the function (if any)
"""
from django.core.urlresolvers import reverse
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.core.cache import cache
# create a fake request object
request = HttpRequest()
# Loookup the request path:
if namespace:
view_name = namespace + ":" + view_name
request.path = reverse(view_name, args=args)
# get cache key, expire if the cached item exists:
key = get_cache_key(request, key_prefix=key_prefix)
if key:
if cache.get(key):
# Delete the cache entry.
#
# Note that there is a possible race condition here, as another
# process / thread may have refreshed the cache between
# the call to cache.get() above, and the cache.set(key, None)
# below. This may lead to unexpected performance problems under
# severe load.
cache.set(key, None, 0)
return True
return False
Django keys these caches of the view request, so what this does is creates a fake request object for the cached view, uses that to fetch the cache key, then expires it.
To use it in the way you're talking about, try something like:
from django.db.models.signals import post_save
from blog.models import Entry
def invalidate_blog_index(sender, **kwargs):
expire_view_cache("blog")
post_save.connect(invalidate_portfolio_index, sender=Entry)
So basically, when ever a blog Entry object is saved, invalidate_blog_index is called and the cached view is expired. NB: haven't tested this extensively, but it's worked fine for me so far.
The cache_page decorator will use CacheMiddleware in the end which will generate a cache key based on the request (look at django.utils.cache.get_cache_key) and the key_prefix ("blog" in your case). Note that "blog" is only a prefix, not the whole cache key.
You can get notified via django's post_save signal when a comment is saved, then you can try to build the cache key for the appropriate page(s) and finally say cache.delete(key).
However this requires the cache_key, which is constructed with the request for the previously cached view. This request object is not available when a comment is saved. You could construct the cache key without the proper request object, but this construction happens in a function marked as private (_generate_cache_header_key), so you are not supposed to use this function directly. However, you could build an object that has a path attribute that is the same as for the original cached view and Django wouldn't notice, but I don't recommend that.
The cache_page decorator abstracts caching quite a bit for you and makes it hard to delete a certain cache object directly. You could make up your own keys and handle them in the same way, but this requires some more programming and is not as abstract as the cache_page decorator.
You will also have to delete multiple cache objects when your comments are displayed in multiple views (i.e. index page with comment counts and individual blog entry pages).
To sum up: Django does time based expiration of cache keys for you, but custom deletion of cache keys at the right time is more tricky.
I wrote Django-groupcache for this kind of situations (you can download the code here). In your case, you could write:
from groupcache.decorators import cache_tagged_page
#cache_tagged_page("blog", 60 * 15)
def blog(request):
...
From there, you could simply do later on:
from groupcache.utils import uncache_from_tag
# Uncache all view responses tagged as "blog"
uncache_from_tag("blog")
Have a look at cache_page_against_model() as well: it's slightly more involved, but it will allow you to uncache responses automatically based on model entity changes.
With the latest version of Django(>=2.0) what you are looking for is very easy to implement:
from django.utils.cache import learn_cache_key
from django.core.cache import cache
from django.views.decorators.cache import cache_page
keys = set()
#cache_page( 60 * 15, "blog" );
def blog( request ):
response = render(request, 'template')
keys.add(learn_cache_key(request, response)
return response
def invalidate_cache()
cache.delete_many(keys)
You can register the invalidate_cache as a callback when someone updates a post in the blog via a pre_save signal.
This won't work on django 1.7; as you can see here https://docs.djangoproject.com/en/dev/releases/1.7/#cache-keys-are-now-generated-from-the-request-s-absolute-url the new cache keys are generated with the full URL, so a path-only fake request won't work. You must setup properly request host value.
fake_meta = {'HTTP_HOST':'myhost',}
request.META = fake_meta
If you have multiple domains working with the same views, you should cycle them in the HTTP_HOST, get proper key and do the clean for each one.
Django view cache invalidation for v1.7 and above. Tested on Django 1.9.
def invalidate_cache(path=''):
''' this function uses Django's caching function get_cache_key(). Since 1.7,
Django has used more variables from the request object (scheme, host,
path, and query string) in order to create the MD5 hashed part of the
cache_key. Additionally, Django will use your server's timezone and
language as properties as well. If internationalization is important to
your application, you will most likely need to adapt this function to
handle that appropriately.
'''
from django.core.cache import cache
from django.http import HttpRequest
from django.utils.cache import get_cache_key
# Bootstrap request:
# request.path should point to the view endpoint you want to invalidate
# request.META must include the correct SERVER_NAME and SERVER_PORT as django uses these in order
# to build a MD5 hashed value for the cache_key. Similarly, we need to artificially set the
# language code on the request to 'en-us' to match the initial creation of the cache_key.
# YMMV regarding the language code.
request = HttpRequest()
request.META = {'SERVER_NAME':'localhost','SERVER_PORT':8000}
request.LANGUAGE_CODE = 'en-us'
request.path = path
try:
cache_key = get_cache_key(request)
if cache_key :
if cache.has_key(cache_key):
cache.delete(cache_key)
return (True, 'successfully invalidated')
else:
return (False, 'cache_key does not exist in cache')
else:
raise ValueError('failed to create cache_key')
except (ValueError, Exception) as e:
return (False, e)
Usage:
status, message = invalidate_cache(path='/api/v1/blog/')
I had same problem and I didn't want to mess with HTTP_HOST, so I created my own cache_page decorator:
from django.core.cache import cache
def simple_cache_page(cache_timeout):
"""
Decorator for views that tries getting the page from the cache and
populates the cache if the page isn't in the cache yet.
The cache is keyed by view name and arguments.
"""
def _dec(func):
def _new_func(*args, **kwargs):
key = func.__name__
if kwargs:
key += ':' + ':'.join([kwargs[key] for key in kwargs])
response = cache.get(key)
if not response:
response = func(*args, **kwargs)
cache.set(key, response, cache_timeout)
return response
return _new_func
return _dec
To expired page cache just need to call:
cache.set('map_view:' + self.slug, None, 0)
where self.slug - param from urls.py
url(r'^map/(?P<slug>.+)$', simple_cache_page(60 * 60 * 24)(map_view), name='map'),
Django 1.11, Python 3.4.3
FWIW I had to modify mazelife's solution to get it working:
def expire_view_cache(view_name, args=[], namespace=None, key_prefix=None, method="GET"):
"""
This function allows you to invalidate any view-level cache.
view_name: view function you wish to invalidate or it's named url pattern
args: any arguments passed to the view function
namepace: optioal, if an application namespace is needed
key prefix: for the #cache_page decorator for the function (if any)
from: http://stackoverflow.com/questions/2268417/expire-a-view-cache-in-django
added: method to request to get the key generating properly
"""
from django.core.urlresolvers import reverse
from django.http import HttpRequest
from django.utils.cache import get_cache_key
from django.core.cache import cache
# create a fake request object
request = HttpRequest()
request.method = method
# Loookup the request path:
if namespace:
view_name = namespace + ":" + view_name
request.path = reverse(view_name, args=args)
# get cache key, expire if the cached item exists:
key = get_cache_key(request, key_prefix=key_prefix)
if key:
if cache.get(key):
cache.set(key, None, 0)
return True
return False
Instead of using the cache page decorator, you could manually cache the blog post object (or similar) if there are no comments, and then when there's a first comment, re-cache the blog post object so that it's up to date (assuming the object has attributes that reference any comments), but then just let that cached data for the commented blog post expire and then no bother re-cacheing...
Instead of explicit cache expiration you could probably use new "key_prefix" every time somebody comment the post. E.g. it might be datetime of the last post's comment (you could even combine this value with the Last-Modified header).
Unfortunately Django (including cache_page()) does not support dynamic "key_prefix"es (checked on Django 1.9) but there is workaround exists. You can implement your own cache_page() which may use extended CacheMiddleware with dynamic "key_prefix" support included. For example:
from django.middleware.cache import CacheMiddleware
from django.utils.decorators import decorator_from_middleware_with_args
def extended_cache_page(cache_timeout, key_prefix=None, cache=None):
return decorator_from_middleware_with_args(ExtendedCacheMiddleware)(
cache_timeout=cache_timeout,
cache_alias=cache,
key_prefix=key_prefix,
)
class ExtendedCacheMiddleware(CacheMiddleware):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
if callable(self.key_prefix):
self.key_function = self.key_prefix
def key_function(self, request, *args, **kwargs):
return self.key_prefix
def get_key_prefix(self, request):
return self.key_function(
request,
*request.resolver_match.args,
**request.resolver_match.kwargs
)
def process_request(self, request):
self.key_prefix = self.get_key_prefix(request)
return super().process_request(request)
def process_response(self, request, response):
self.key_prefix = self.get_key_prefix(request)
return super().process_response(request, response)
Then in your code:
from django.utils.lru_cache import lru_cache
#lru_cache()
def last_modified(request, blog_id):
"""return fresh key_prefix"""
#extended_cache_page(60 * 15, key_prefix=last_modified)
def view_blog(request, blog_id):
"""view blog page with comments"""
Most of the solutions above didn't work in our case because we use https. The source code for get_cache_key reveals that it uses request.get_absolute_uri() to generate the cache key.
The default HttpRequest class sets the scheme as http. Thus we need to override it to use https for our dummy request object.
This is the code that works fine for us :)
from django.core.cache import cache
from django.http import HttpRequest
from django.utils.cache import get_cache_key
class HttpsRequest(HttpRequest):
#property
def scheme(self):
return "https"
def invalidate_cache_page(
path,
query_params=None,
method="GET",
):
request = HttpsRequest()
# meta information can be checked from error logs
request.META = {
"SERVER_NAME": "www.yourwebsite.com",
"SERVER_PORT": "443",
"QUERY_STRING": query_params,
}
request.path = path
key = get_cache_key(request, method=method)
if cache.has_key(key):
cache.delete(key)
Now I can use this utility function to invalidate the cache from any of our views:
page = reverse('url_name', kwargs={'id': obj.id})
invalidate_cache_page(path)
Duncan's answer works well with Django 1.9. But if we need invalidate url with GET-parameter we have to make a little changes in request.
Eg for .../?mykey=myvalue
request.META = {'SERVER_NAME':'127.0.0.1','SERVER_PORT':8000, 'REQUEST_METHOD':'GET', 'QUERY_STRING': 'mykey=myvalue'}
request.GET.__setitem__(key='mykey', value='myvalue')
I struggled with a similar situation and here is the solution I came up with, I started it on an earlier version of Django but it is currently in use on version 2.0.3.
First issue: when you set things to be cached in Django, it sets headers so that downstream caches -- including the browser cache -- cache your page.
To override that, you need to set middleware. I cribbed this from elsewhere on StackOverflow, but can't find it at the moment. In appname/middleware.py:
from django.utils.cache import add_never_cache_headers
class Disable(object):
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
response = self.get_response(request)
add_never_cache_headers(response)
return response
Then in settings.py, to MIDDLEWARE, add:
'appname.middleware.downstream_caching.Disable',
Keep in mind that this approach completely disables downstream caching, which may not be what you want.
Finally, I added to my views.py:
def expire_page(request, path=None, query_string=None, method='GET'):
"""
:param request: "real" request, or at least one providing the same scheme, host, and port as what you want to expire
:param path: The path you want to expire, if not the path on the request
:param query_string: The query string you want to expire, as opposed to the path on the request
:param method: the HTTP method for the page, if not GET
:return: None
"""
if query_string is not None:
request.META['QUERY_STRING'] = query_string
if path is not None:
request.path = path
request.method = method
# get_raw_uri and method show, as of this writing, everything used in the cache key
# print('req uri: {} method: {}'.format(request.get_raw_uri(), request.method))
key = get_cache_key(request)
if key in cache:
cache.delete(key)
I didn't like having to pass in a request object, but as of this writing, it provides the scheme/protocol, host, and port for the request, pretty much any request object for your site/app will do, as long as you pass in the path and query string.
One more updated version of Duncan's answer: had to figure out correct meta fields: (tested on Django 1.9.8)
def invalidate_cache(path=''):
import socket
from django.core.cache import cache
from django.http import HttpRequest
from django.utils.cache import get_cache_key
request = HttpRequest()
domain = 'www.yourdomain.com'
request.META = {'SERVER_NAME': socket.gethostname(), 'SERVER_PORT':8000, "HTTP_HOST": domain, 'HTTP_ACCEPT_ENCODING': 'gzip, deflate, br'}
request.LANGUAGE_CODE = 'en-us'
request.path = path
try:
cache_key = get_cache_key(request)
if cache_key :
if cache.has_key(cache_key):
cache.delete(cache_key)
return (True, 'successfully invalidated')
else:
return (False, 'cache_key does not exist in cache')
else:
raise ValueError('failed to create cache_key')
except (ValueError, Exception) as e:
return (False, e)
The solution is simple, and do not require any additional work.
Example
#cache_page(60 * 10)
def our_team(request, sorting=None):
...
This will set the response to the cache with the default key.
Expire a view cache
from django.utils.cache import get_cache_key
from django.core.cache import cache
def our_team(request, sorting=None):
# This will remove the cache value and set it to None
cache.set(get_cache_key(request), None)
Simple, Clean, Fast.

Add request header before redirection

I am retrofitting a Django web app for a new client. To this end I have added a url pattern that redirects requests from new client to old url patterns.
from:-
(('api/(?P<phone>\w+)/MessageA', handle_a_message),
('api/(?P<phone>\w+)/MessageB', handle_b_message),
...)
to:-
(('api/(?P<phone>\w+)/MessageA', handle_a_message),
('api/(?P<phone>\w+)/MessageB', handle_b_message),
('api/newclient', handle_newclient)
...)
views.handle_newclient
def handle_newclient(request):
return redirect('/api/%(phone)s/%(msg)s' % request.GET)
This somewhat works. However the new client doesn't do basic auth which those url's need. Also the default output is json where the new client needs plain text. Is there a way I can tweak the headers before redirecting to the existing url's?
Django FBV's should return an HTTPResponse object (or subclass thereof). The Django shorcut redirect returns HttpResponseRedirect which is a subclass of HTTPResponse. This means we can set the headers for redirect() the way we will set headers for a typical HTTPResponse object. We can do that like so:
def my_view(request):
response = redirect('http://www.gamefaqs.com')
# Set 'Test' header and then delete
response['Test'] = 'Test'
del response['Test']
# Set 'Test Header' header
response['Test Header'] = 'Test Header'
return response
Relevant docs here and here.

Django: simulate HTTP requests in shell

I just learnt that with Rails is possible to simulate HTTP requests in the console with few lines of code.
Check out: http://37signals.com/svn/posts/3176-three-quick-rails-console-tips (section "Dive into your app").
Is there a similar way to do that with Django? Would be handy.
You can use RequestFactory, which allows
inserting a user into the request
inserting an uploaded file into the request
sending specific parameters to the view
and does not require the additional dependency of using requests.
Note that you have to specify both the URL and the view class, so it takes an extra line of code than using requests.
from django.test import RequestFactory
request_factory = RequestFactory()
my_url = '/my_full/url/here' # Replace with your URL -- or use reverse
my_request = request_factory.get(my_url)
response = MyClasBasedView.as_view()(my_request) # Replace with your view
response.render()
print(response)
To set the user of the request, do something like my_request.user = User.objects.get(id=123) before getting the response.
To send parameters to a class-based view, do something like response = MyClasBasedView.as_view()(my_request, parameter_1, parameter_2)
Extended Example
Here's an example of using RequestFactory with these things in combination
HTTP POST (to url url, functional view view, and a data dictionary post_data)
uploading a single file (path file_path, name file_name, and form field value file_key)
assigning a user to the request (user)
passing on kwargs dictionary from the url (url_kwargs)
SimpleUploadedFile helps format the file in a way that is valid for forms.
from django.core.files.uploadedfile import SimpleUploadedFile
from django.test import RequestFactory
request = RequestFactory().post(url, post_data)
with open(file_path, 'rb') as file_ptr:
request.FILES[file_key] = SimpleUploadedFile(file_name, file_ptr.read())
file_ptr.seek(0) # resets the file pointer after the read
if user:
request.user = user
response = view(request, **url_kwargs)
Using RequestFactory from a Python shell
RequestFactory names your server "testserver" by default, which can cause a problem if you're not using it inside test code. You'll see an error like:
DisallowedHost: Invalid HTTP_HOST header: 'testserver'. You may need to add 'testserver' to ALLOWED_HOSTS.
This workaround from #boatcoder's comment shows how to override the default server name to "localhost":
request_factory = RequestFactory(**{"SERVER_NAME": "localhost", "wsgi.url_scheme":"https"}).
How I simulate requests from the python command line is:
Use the excellent requests library
Use the django reverse function
A simple way of simulating requests is:
>>> from django.urls import reverse
>>> import requests
>>> r = requests.get(reverse('app.views.your_view'))
>>> r.text
(prints output)
>>> r.status_code
200
Update: be sure to launch the django shell (via manage.py shell), not a classic python shell.
Update 2: For Django <1.10, change the first line to
from django.core.urlresolvers import reverse
(See tldr; down)
Its an old question,
but just adding an answer, in case someone maybe interested.
Though this might not be the best(or lets say Django) way of doing things.
but you can try doing this way.
Inside your django shell
>>> import requests
>>> r = requests.get('your_full_url_here')
Explanation:
I omitted the reverse(),
explanation being, since reverse() more or less,
finds the url associated to a views.py function,
you may omit the reverse() if you wish to, and put the whole url instead.
For example, if you have a friends app in your django project,
and you want to see the list_all() (in views.py) function in the friends app,
then you may do this.
TLDR;
>>> import requests
>>> url = 'http://localhost:8000/friends/list_all'
>>> r = requests.get(url)

How to find the location URL in a Django response object?

Let's say I have a Django response object.
I want to find the URL (location).
However, the response header does not actually contain a Location or Content-Location field.
How do I determine, from this response object, the URL it is showing?
This is old, but I ran into a similar issue when doing unit tests. Here is how I solved the problem.
You can use the response.redirect_chain and/or the response.request['PATH_INFO'] to grab redirect urls.
Check out the documentation as well! Django Testing Tools: assertRedirects
from django.core.urlresolvers import reverse
from django.test import TestCase
class MyTest(TestCase)
def test_foo(self):
foo_path = reverse('foo')
bar_path = reverse('bar')
data = {'bar': 'baz'}
response = self.client.post(foo_path, data, follow=True)
# Get last redirect
self.assertGreater(len(response.redirect_chain), 0)
# last_url will be something like 'http://testserver/.../'
last_url, status_code = response.redirect_chain[-1]
self.assertIn(bar_path, last_url)
self.assertEqual(status_code, 302)
# Get the exact final path from the response,
# excluding server and get params.
last_path = response.request['PATH_INFO']
self.assertEqual(bar_path, last_path)
# Note that you can also assert for redirects directly.
self.assertRedirects(response, bar_path)
The response does not decide the url, the request does.
The response gives you the content of the response, not the url of it.

Django URL Detail

I have to assign to work on one Django project. I need to know about the URL say, http://....
Since with ‘urls.py’ we indeed have ‘raw’ information. How I come to know about the complete URL name; mean with
http+domain+parameters
Amit.
Look at this snippet :
http://djangosnippets.org/snippets/1197/
I modified it like this :
from django.contrib.sites.models import RequestSite
from django.contrib.sites.models import Site
def site_info(request):
site_info = {'protocol': request.is_secure() and 'https' or 'http'}
if Site._meta.installed:
site_info['domain'] = Site.objects.get_current().domain
site_info['name'] = Site.objects.get_current().name
else:
site_info['domain'] = RequestSite(request).domain
site_info['name'] = RequestSite(request).name
site_info['root'] = site_info['protocol'] + '://' + site_info['domain']
return {'site_info':site_info}
The if/else is because of different versions of Django Site API
This snippet is actually a context processor, so you have to paste it in a file called context_processors.py in your application, then add to your settings :
TEMPLATE_CONTEXT_PROCESSORS = DEFAULT_SETTINGS.TEMPLATE_CONTEXT_PROCESSORS + (
'name-of-your-app.context_processors.site_info',
)
The + is here to take care that we d'ont override the possible default context processor set up by django, now or in the future, we just add this one to the tuple.
Finally, make sure that you use RequestContext in your views when returning the response, and not just Context. This explained here in the docs.
It's just a matter of using :
def some_view(request):
# ...
return render_to_response('my_template.html',
my_data_dictionary,
context_instance=RequestContext(request))
HTTPS status would be handled differently by different web servers.
For my Nginx reverse proxy to Apache+WSGI setup, I explicitly set a header that apache (django) can check to see if the connection is secure.
This info would not be available in the URL but in your view request object.
django uses request.is_secure() to determine if the connection is secure. How it does so depends on the backend.
http://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.is_secure
For example, for mod_python, it's the following code:
def is_secure(self):
try:
return self._req.is_https()
except AttributeError:
# mod_python < 3.2.10 doesn't have req.is_https().
return self._req.subprocess_env.get('HTTPS', '').lower() in ('on', '1')
If you are using a proxy, you will probably find it useful that HTTP Headers are available in HttpRequest.META
http://docs.djangoproject.com/en/dev/ref/request-response/#django.http.HttpRequest.META
Update: if you want to log every secure request, use the above example with a middleware
class LogHttpsMiddleware(object):
def process_request(self, request):
if request.is_secure():
protocol = 'https'
else:
protocol = 'http'
print "%s://www.mydomain.com%s" % (protocol, request.path)
Add LogHttpsMiddleware to your settings.py MIDDLEWARE_CLASSES