How to recache a page from a management command? - django

I've got an problem with a sitemaps.xml taking so long to run that the search engines timeout. There is no memcached installed so I've quickly added a FileBasedCache which happily solves the immediate problem, excepting that first request per cache life.
The sitemap changes once a week and that event is invoked by a cron job which calls a management command which loads new data. So the immediate idea is to extend the cache life to a week and force a flush and reload of the cache whenever the cronjob/management command runs.
But how might one do that?

As you don't seem to be caching elsewhere on the site for the time being the following should clear the entire cache:
import urllib2
from django.core.urlresolvers import reverse
from django.conf import settings
from django.core.cache import cache
sys.stdout.write('Rebuilding sitemap\n')
cache.clear()
sitemap = urllib2.urlopen('http://'+settings.HOST_DOMAIN+reverse('sitemap'))
sitemap.read()
I would then use urllib2 to send a request to yoursite/sitemaps.xml which should recache the new page.

Related

Load data in memory during server startup in Django app

I am creating a Django app wherein I need to load and keep some data in memory when the server starts for quick access. To achieve this I am using Django's AppConfig class. The code looks something like this :
from django.core.cache import cache
class myAppConfig(AppConfig):
name = 'myapp'
def ready(self):
data = myModel.objects.values('A','B','C')
cache.set('mykey', data)
The problem is that this data in cache will expire after sometime. One alternative is to increase the TIMEOUT value. But I want this to be available in memory all the time. Is there some other configuration or approach which I can use to achieve this ?
If you are using Redis. It does provide persistent data capture.
Check it out in https://redis.io/topics/persistence
You have to set it up by Redis Configuration.
There will be two methods to keep the data persist in the cache after reboot - both of them will save the data in other formats in HDD. I will suggest you set up the AOF for your purpose since it will record every write operation. However, it will cause more space.

Removing items from celery_beat doesn't remove them from database schedule

I'm using django-celery-beat in a django app (this stores the schedule in the database instead of a local file). I've configured my schedule via celery_beat that Celery is initialized with via app.config_from_object(...)
I recently renamed/removed a few tasks and restarted the app. The new tasks showed up, but the tasks removed from the celery_beat dictionary didn't get removed from the database.
Is this expected workflow -- requiring manual removal of tasks from the database? Is there a workaround to automatically reconcile the schedule at Django startup?
I tried a PeriodicTask.objects.all().delete() in celery/__init__.py
def _clean_schedule():
from django.db import transaction
from django_celery_beat.models import PeriodicTask
from django_celery_beat.models import PeriodicTasks
with transaction.atomic():
PeriodicTask.objects.\
exclude(task__startswith='celery.').\
exclude(name__in=settings.CELERY_CONFIG.celery_beat.keys()).\
delete()
PeriodicTasks.update_changed()
_clean_schedule()
but that is not allowed because Django isn't properly started up yet:
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.
You also can't use Django's AppConfig.ready() because making queries / db connections in ready() is not supported.
Looking at how django-celery-beat actually works to install the schedules, I thought I maybe I could hook into that process.
It doesn't happen when Django starts -- it happens when beat starts. It calls setup_schedule() against the class passed on the beat command line.
Therefore, we can just override the scheduler with
--scheduler=myproject.lib.scheduler:DatabaseSchedulerWithCleanup
to do cleanup:
import logging
from django_celery_beat.models import PeriodicTask
from django_celery_beat.models import PeriodicTasks
from django_celery_beat.schedulers import DatabaseScheduler
from django.db import transaction
class DatabaseSchedulerWithCleanup(DatabaseScheduler):
def setup_schedule(self):
schedule = self.app.conf.beat_schedule
with transaction.atomic():
num, info = PeriodicTask.objects.\
exclude(task__startswith='celery.').\
exclude(name__in=schedule.keys()).\
delete()
logging.info("Removed %d obsolete periodic tasks.", num)
if num > 0:
PeriodicTasks.update_changed()
super(DatabaseSchedulerWithCleanup, self).setup_schedule()
Note, you only want this if you are exclusively managing tasks with beat_schedule. If you add tasks via Django admin or programatically, they will also be deleted.

How do I turn off django cumulus in my local_settings.py

I have taken over a project that uses django cumulus for cloud storage. On my development machine, some times I use a slow internet connection, and every time I save a change, django recompiles and tries to make a connection to the racksapace store
Starting new HTTPS connection (1): identity.api.rackspacecloud.com
This sometimes takes 15 seconds and is a real pain. I read a post where someone said they turned off cumulus for local development. I think this was done by setting
DEFAULT_FILE_STORAGE
but unfortunately the poster did not specify. If someone knows a simple setting I can put in my local settings to serve media and static files from my local machine and stop django trying to connect to my cloud storage on every save, that is what I want to do.
Yeah it looks like you should just need the DEFAULT_FILE_STORAGE to be default value, which is django.core.files.storage.FileSystemStorage according to the source code.
However, a better approach would be to not set anything in your local settings and set the DEFAULT_FILE_STORAGE and CUMULUS in a staging_settings.py or prod_settings.py file.
The constant reloading of the rackspace bucket was because the previous developer had
from cumulus.storage import SwiftclientStorage
class PrivateStorage(SwiftclientStorage):
and in models.py
from common.storage import PrivateStorage
PRIVATE_STORE = PrivateStorage()
...
class Upload(models.Model):
upload = models.FileField(storage=PRIVATE_STORE, upload_to=get_upload_path)
This meant every time the project reloaded, it would create a new https connection to rackspace, and time out if the connection was poor. I created a settings flag to control this by putting the import of SwiftclientStorage and defining of PrivateStorage like so
from django.conf import settings
if settings.USECUMULUS:
from cumulus.storage import SwiftclientStorage
class PrivateStorage(SwiftclientStorage):
...
else:
class PrivateStorage():
pass

Is it dangerous to use dynamic database TEST_NAME in Django?

We are a small team of developers working on an application using the PostgreSQL database backend. Each of us has a separate working directory and virtualenv, but we share the same PostgreSQL database server, even Jenkins is on the same machine.
So I am trying to figure out a way to allow us to run tests on the same project in parallel without running into a database name collision. Furthermore, sometimes a Jenkins build would fail mid-way and the test database doesn't get dropped in the end, such that subsequent Jenkins build could get confused by the existing database and fail automatically.
What I've decided to try is this:
import os
from datetime import datetime
DATABASES = {
'default': {
# the usual lines ...
TEST_NAME: '{user}_{epoch_ts}_awesome_app'.format(
user=os.environ.get('USER', os.environ['LOGNAME']),
# This gives the number of seconds since the UNIX epoch
epoch_ts=int((datetime.utcnow() - datetime.utcfromtimestamp(0)).total_seconds())
),
# etc
}
}
So the test database name at each test run most probably will be unique, using the username and the timestamp. This way Jenkins can even run builds in parallel, I think.
It seems to work so far. But could it be dangerous? I'm guessing we're safe as long as we don't try to import the project settings module directly and only use django.conf.settings because that should be singleton-like and evaluated only once, right?
I'm doing something similar and have not run into any issue. The usual precautions should be followed:
Don't access settings directly.
Don't cause the values in settings to be evaluated in a module's top level. See the doc for details. For instance, don't do this:
from django.conf import settings
# This is bad because settings might still be in the process of being
# configured at this stage.
blah = settings.BLAH
def some_view():
# This is okay because by the time views are called by Django,
# the settings are supposed to be all configured.
blah = settings.BLAH
Don't do anything that accesses the database in a module's top level. The doc warns:
If your code attempts to access the database when its modules are compiled, this will occur before the test database is set up, with potentially unexpected results. For example, if you have a database query in module-level code and a real database exists, production data could pollute your tests. It is a bad idea to have such import-time database queries in your code anyway - rewrite your code so that it doesn’t do this.
Instead of the time, you could use the Jenkins executor number (available in the environment); that would be unique enough and you wouldn't have to worry about it changing.
As a bonus, you could then use --keepdb to avoid rebuilding the database from scratch each time... On the downside, failed and corrupted databases would have to be dropped separately (perhaps the settings.py can print out the database name that it's using, to facilitate manual dropping).

Run Scrapy script, process output, and load into database all at once?

I've managed to write a Scrapy project that scrapes data from a web page, and when I call it with the scrapy crawl dmoz -o items.json -t json at the command line, it successfully outputs the scraped data to a JSON file.
I then wrote another script that takes that JSON file, loads it, changes the way the data is organized (I didn't like the default way it was being organized), and spits it out as a second JSON file. I then use Django's manage.py loaddata fixture.json command to load the contents of that second file into a Django database.
Now, I'm sensing that I'm going to get laughed out of the building for doing this in three separate steps, but I'm not quite sure how to put it all together into one script. For starters, it does seem really stupid that I can't just have my Scrapy project output my data in the exact way that I want. But where do I put the code to modify the 'default' way that Feed exports is outputting my data? Would it just go in my pipelines.py file?
And secondly, I want to call the scraper from inside a python script that will then also load the resulting JSON fixture into my database. Is that as simple as putting something like:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
at the top of my script, and then following it with something like:
from django.something.manage import loaddata
loaddata('/path/to/fixture.json')
? And finally, is there any specific place this script would have to live relative to both my Django project and the Scrapy project for it to work properly?
Exactly that. Define a custom item pipeline in pipelines.py to output the item data as desired, then add the pipeline class to settings.py. The scrapy documentation has a JSONWriterPipeline example that may be of use.
Well, on the basis that the script in your example was taken from the scrapy documentation, it should work. Have you tried it?
The location shouldn't matter, so long as all of the necessary imports work. You could test this by firing a Python interpreter in the desired location and then checking all of the imports one by one. If they all run correctly, then the script should be fine.
If anything goes wrong, then post another question in here with the relevant code and what was tried and I'm sure someone will be happy to help out. :)