Django Models - Auto-Refresh field value - django

In my django models.py i crawl item's prices from amazon using lxml.
When i hit save in the admin page, it store this price in a "price" field, but sometimes amazon prices changes, so i would like to update the price automatically every 2 days. This is my function for now:
class AmazonItem(models.Model):
amazon_url = models.CharField(max_length=800, null=True, blank=True)
price = models.DecimalField(max_digits=6, decimal_places=2, editable=False)
last_update = models.DateTimeField(editable=False)
def save(self):
if not self.id:
if self.amazon_url:
url = self.amazon_url
source_code = requests.get(url)
code = html.fromstring(source_code.text)
prices = code.xpath('//span[#id="priceblock_ourprice"]/text()')
eur = prezzi[0].replace("EUR ", "")
nospace = eur.replace(" ", "")
nodown = nospace.replace("\n", "")
end = nodown.replace(",", ".")
self.price = float(end)
else:
self.price = 0
self.last_update = datetime.datetime.today()
super(AmazonItem, self).save()
i really have no idea about how to do this, i only would like it to be done automatically

Isolate the sync functionality
First I'd isolate the sync functionality out of the save, e.g. you can create a AmazonItem.sync() method
def sync(self):
# Your HTTP request and HTML parsing here
# Update self.price, self.last_updated etc
Cron job with management command
So now, your starting point will be to call .sync() on the model instances that you want to sync. A very crude* way is to:
for amazon_item in AmazonItem.objects.all():
amazon_item.sync()
amazon_item.save()
You can e.g. put that code inside a Django Command called sync_amazon_items, and setup a cron job to run it each 2 days
# app/management/commands/sync_amazon_items.py
class Command(BaseCommand):
def handle(self, *args, **options):
for amazon_item in AmazonItem.objects.all():
amazon_item.sync()
amazon_item.save()
Then you can make your OS or job scheduler run it, e.g. using python manage.py sync_amazon_items
* This will be very slow as it goes sequentially through your list, also an error in any item will stop the operation, so you'll want to catch exceptions log them and keep going e.g.
Use a task queue / scheduler
A more performing and reliable (isolated failures) way is to queue up sync jobs (e.g. a job for each amazon_item, or a batch of N amazon_items) into a job queue like Celery, then setup Celery concurrence to run a few sync jobs currently
To schedule periodic task with Celery have a look at Periodic Tasks

Related

Date on which maximum number of tasks were completed in a single day django

I have 2 models like this. one user can have multiple tasks
class User(models.Model):
userCreated_at=models.DateTimeField(blank=True)
class Tasks(models.Model):
owner=models.ForeignKey(User , on_delete=models.CASCADE)
title = models.CharField(max_length=100,null=True,blank=True)
completion_date_time = models.DateTimeField(blank=True)
So I want to calculate .
Since time of account creation of user , on what date, maximum number of tasks were completed in a single day . Any help would be highly appericiated
Since you can only complete a task after the user is created, you should not worry about the "account creation date"
Instead, doing Tasks.objects.filter(owner=request.user) should give you a queryset containing all the tasks.
Finding the date when the most number of tasks were completed by a user will be hard using this data model, but maintaining a table with the same will be easy. Here's how you would do it:
class user_max_date(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE) #Need to import usermodel
date = models.DateTimeField(blank=True)
Now everytime a new task is completed, you have to check if the tasks completed today are more than the max tasks completed. In which case we will update this max_task_date
So here's how that would work:
today = date.today() #Will require datetime imports
tasksDoneToday = Tasks.objects.filter(owner=request.user).filter(completion_date_time__year=today.year,
completion_date_time__month=today.month,
completion_date_time__day=today.day)
max_date = request.user.user_max_date.date.date()
max_tasks_done = Tasks.objects.filter(owner=request.user).filter(completion_date_time__year=max_date.year,
completion_date_time__month=max_date.month,
completion_date_time__day=max_date.day)
if len(max_tasks_done) < len(tasks_done_today):
temp = request.user.user_max_date
temp.date = today
temp.save()
You will need to add this logic to your views.py in the view used to complete the tasks, or you can add a signal that executes this logic.

Tweet Scheduler based on Flask Megatutorial: Issues with getting fields of flask-sqlalchemy databse

I'm in the process of modifying the Flask app created in following along Miguel Grinberg's Flask Mega Tutorial such that it is possible to post tweets. I have imported tweepy for accessing the twitter api and modified the databases to hold the scheduled time of a tweet.
I wish to iterate over the current_user's posts and the corresponding times from the SQLAlchemy database and post when the current time matches the scheduled time.
The database model modifications in model.py are as follows:
class Post(db.Model):
id = db.Column(db.Integer, primary_key = True)
body = db.Column(db.String(140))
timestamp = db.Column(db.DateTime, index=True, default=datetime.utcnow)
socialnetwork = db.Column(db.String(40))
user_id = db.Column(db.Integer, db.ForeignKey('user.id'))
#This is the stuff for scheduling, just date
hour = db.Column(db.Integer)
minute = db.Column(db.Integer)
day = db.Column(db.Integer)
month = db.Column(db.Integer)
year = db.Column(db.Integer)
ampm = db.Column(db.String(2))
Just as a test, I wanted to iterate over the current user's posts and tweet them using tweepy:
#app.before_first_request
def activate_job():
def run_job():
posts = current_user.followed_posts().filter_by(socialnetwork ='Twitter')
for post in posts:
tweepy_api.update_status(message)
time.sleep(30)
thread = threading.Thread(target=run_job)
thread.start()
However, this returned the error:
AttributeError: 'NoneType' object has no attribute 'followed_posts'
on the terminal. This is perplexing me as I have used current_user multiple times in the same file to filter the posts by social network.
As in the following case in routes.py
#app.route('/<username>')
#login_required
def user(username):
user = User.query.filter_by(username = username).first_or_404()
socialnetwork = request.args.get("socialnetwork")
if socialnetwork == 'Facebook':
posts = current_user.followed_posts().filter_by(socialnetwork = 'Facebook')
elif socialnetwork == 'Twitter':
posts = current_user.followed_posts().filter_by(socialnetwork = 'Twitter')
else:
posts = current_user.followed_posts()
return render_template('user.html', user = user, posts = posts, form = socialnetwork)
The above yields no error and works perfectly.
If anyone could shed some light on what I am doing wrong, I'd be truly grateful.
You're likely running into issues because you're trying to get current_user on a different thread (see the Flask docs for more details). You're calling run_job() in a different context that doesn't have any current user (because there's no active request).
I'd rework it so that you get the current user's posts on the main thread (i.e. in activate_job(), then pass the list of posts to the background process to work on.
Something like:
def activate_job():
posts = current_user.followed_posts().filter_by(socialnetwork ='Twitter')
def run_job(posts):
for post in posts:
tweepy_api.update_status(message)
time.sleep(30)
thread = threading.Thread(target=run_job, args=[posts])
thread.start()
It's also worth noting that you may want to rethink your overall approach. Rather than checking with each request if there are any scheduled tweets to send, you should use some sort of background task queue that an operate independently of the web process. That way, you're not checking redundantly on each request, and you're not dependant on the user making requests around the scheduled time.
See The Flask Mega-Tutorial Part XXII: Background Jobs for more details, and look into Celery.

Django: object creation in atomic transaction

I have a simple Task model:
class Task(models.Model):
name = models.CharField(max_length=255)
order = models.IntegerField(db_index=True)
And a simple task_create view:
def task_create(request):
name = request.POST.get('name')
order = request.POST.get('order')
Task.objects.filter(order__gte=order).update(order=F('order') + 1)
new_task = Task.objects.create(name=name, order=order)
return HttpResponse(new_task.id)
View shifts existing tasks that goes after newly created by + 1, then creates a new one.
And there are lots of users of this method, and I suppose something will go wrong one day with ordering because update and create definitely should be performed together.
So, I just want to be shure, will it be enough to avoid any data corruptions:
from django.db import transaction
def task_create(request):
name = request.POST.get('name')
order = request.POST.get('order')
with transaction.atomic():
Task.objects.select_for_update().filter(order__gte=order).update(order=F('order') + 1)
new_task = Task.objects.create(name=name, order=order)
return HttpResponse(new_task.id)
1) Probably, something more should be done in task creation line like select_for_update before filter of existing Task.objects?
2) Does it matter where return HttpResponse() is located? Inside transaction block or outside?
Big thx
1) Probably, something more should be done in task creation line like select_for_update before filter of existing Task.objects?
No - what you have currently looks fine and should work the way you want it to.
2) Does it matter where return HttpResponse() is located? Inside transaction block or outside?
Yes, it does matter. You need to return a response to the client regardless of whether your transaction was successful or not - so it definitely needs to be outside of the transaction block. If you did it inside the transaction, the client would get a 500 Server Error if the transaction failed.
However if the transaction fails, then you will not have a new task ID and cannot return that in your response. So you probably need to return different responses depending on whether the transaction succeeds, e.g,:
from django.db import IntegrityError, transaction
try:
with transaction.atomic():
Task.objects.select_for_update().filter(order__gte=order).update(
order=F('order') + 1)
new_task = Task.objects.create(name=name, order=order)
except IntegrityError:
# Transaction failed - return a response notifying the client
return HttpResponse('Failed to create task, please try again!')
# If it succeeded, then return a normal response
return HttpResponse(new_task.id)
You could also try to change your model so you don't need to update so many other rows when inserting a new one.
For example, you could try something resembling a double-linked list.
(I used long explicit names for fields and variables here).
# models.py
class Task(models.Model):
name = models.CharField(max_length=255)
task_before_this_one = models.ForeignKey(
Task,
null=True,
blank=True,
related_name='task_before_this_one_set')
task_after_this_one = models.ForeignKey(
Task,
null=True,
blank=True,
related_name='tasks_after_this_one_set')
Your task at the top of the queue would be the one that has the field task_before_this_one set to null. So to get the first task of the queue:
# these will throw exceptions if there are many instances
first_task = Task.objects.get(task_before_this_one=None)
last_task = Task.objects.get(task_after_this_one=None)
When inserting a new instance, you just need to know after which task it should be placed (or, alternatively, before which task). This code should do that:
def task_create(request):
new_task = Task.objects.create(
name=request.POST.get('name'))
task_before = get_object_or_404(
pk=request.POST.get('task_before_the_new_one'))
task_after = task_before.task_after_this_one
# modify the 2 other tasks
task_before.task_after_this_one = new_task
task_before.save()
if task_after is not None:
# 'task_after' will be None if 'task_before' is the last one in the queue
task_after.task_before_this_one = new_task
task_after.save()
# update newly create task
new_task.task_before_this_one = task_before
new_task.task_after_this_one = task_after # this could be None
new_task.save()
return HttpResponse(new_task.pk)
This method only updates 2 other rows when inserting a new row. You might still want to wrap the whole method in a transaction if there is really high concurrency in your app, but this transaction will only lock up to 3 rows, not all the others as well.
This approach might be of use to you if you have a very long list of tasks.
EDIT: how to get an ordered list of tasks
This can not be done at the database level in a single query (as far as I know), but you could try this function:
def get_ordered_task_list():
# get the first task
aux_task = Task.objects.get(task_before_this_one=None)
task_list = []
while aux_task is not None:
task_list.append(aux_task)
aux_task = aux_task.task_after_this_one
return task_list
As long as you only have a few hundered tasks, this operation should not take that much time so that it impacts the response time. But you will have to try that out for yourself, in your environment, your database, your hardware.

Multiple Postgres SELECT processes(django GET requests) stuck, causing 100% CPU usage

I'll try to give as much information I can here. Although the solution would be great, I just want guidance on how to tackle the problem. How to view more useful log files, etc. As I'm new to server maintainance. Any advice are welcome.
Here's what's happenning in chronological order:
I'm running 2 digitalocean droplets (Ubuntu 14.04 VPS)
Droplet #1 running django, nginx, gunicorn
Droplet #2 running postgres
Everything runs fine for a month and suddenly the postgres droplet
CPU usage spiked to 100%
You can see htop log when this happens. I've attached a screenshot
Another screenshot is nginx error.log, you can see that problem
started at 15:56:14 where I highlighted with red box
sudo poweroff the Postgres droplet and restart it doesn't fix the
problem
Restore postgres droplet to my last backup (20 hours ago) solves the problem but it keep happening again. This is 7th time in 2 days
I'll continue to do research and give more information. Meanwhile any opinions are welcome.
Thank you.
Update 20 May 2016
Enabled slow query logging on Postgres server as recommended by e4c5
6 hours later, server freezed(100% CPU usage) again at 8:07 AM. I've attached all related screenshots
Browser display 502 error if try to access the site during the freeze
sudo service restart postgresql (and gunicorn, nginx on django server) does NOT fix
the freeze (I think this is a very interesting point)
However, restore Postgres server to my previous backup(now 2 days old) does fix the freeze
The culprit Postgres log message is Could not send data to client: Broken
Pipe
The culprit Nginx log message is a simple django-rest-framework
api call which return only 20 items (each with some foreign-key data
query)
Update#2 20 May 2016
When the freeze occurs, I tried doing the following in chronological order (turn off everything and turn them back on one-by-one)
sudo service stop postgresql --> cpu usage fall to 0-10%
sudo service stop gunicorn --> cpu usage stays at 0-10%
sudo service stop nginx--> cpu usage stays at to 0-10%
sudo service restart postgresql --> cpu usage stays at to 0-10%
sudo service restart gunicorn --> cpu usage stays at to 0-10%
sudo service restart nginx --> cpu usage rose to 100% and stays
there
So this is not about server load or long query time then?
This is very confusing since if I restore database to my latest backup (2 days ago), everything is back online even without touching nginx/gunicorn/django server...
Update 8 June 2016
I turned on slow query logging. Set it to log queries that takes longer than 1000ms.
I got this one query shows up in the log many times.
SELECT
"products_product"."id",
"products_product"."seller_id",
"products_product"."priority",
"products_product"."media",
"products_product"."active",
"products_product"."title",
"products_product"."slug",
"products_product"."description",
"products_product"."price",
"products_product"."sale_active",
"products_product"."sale_price",
"products_product"."timestamp",
"products_product"."updated",
"products_product"."draft",
"products_product"."hitcount",
"products_product"."finished",
"products_product"."is_marang_offline",
"products_product"."is_seller_beta_program",
COUNT("products_video"."id") AS "num_video"
FROM "products_product"
LEFT OUTER JOIN "products_video" ON ( "products_product"."id" = "products_video"."product_id" )
WHERE ("products_product"."draft" = false AND "products_product"."finished" = true)
GROUP BY
"products_product"."id",
"products_product"."seller_id",
"products_product"."priority",
"products_product"."media",
"products_product"."active",
"products_product"."title",
"products_product"."slug",
"products_product"."description",
"products_product"."price",
"products_product"."sale_active",
"products_product"."sale_price",
"products_product"."timestamp",
"products_product"."updated",
"products_product"."draft",
"products_product"."hitcount",
"products_product"."finished",
"products_product"."is_marang_offline",
"products_product"."is_seller_beta_program"
HAVING COUNT("products_video"."id") >= 8
ORDER BY "products_product"."priority" DESC, "products_product"."hitcount" DESC
LIMIT 100
I know it's such an ugly query (generated by django aggregation). In English, this query just means "give me a list of products that have more than 8 videos in it".
And here the EXPLAIN output of this query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=351.90..358.40 rows=100 width=933)
-> GroupAggregate (cost=351.90..364.06 rows=187 width=933)
Filter: (count(products_video.id) >= 8)
-> Sort (cost=351.90..352.37 rows=187 width=933)
Sort Key: products_product.priority, products_product.hitcount, products_product.id, products_product.seller_id, products_product.media, products_product.active, products_product.title, products_product.slug, products_product.description, products_product.price, products_product.sale_active, products_product.sale_price, products_product."timestamp", products_product.updated, products_product.draft, products_product.finished, products_product.is_marang_offline, products_product.is_seller_beta_program
-> Hash Right Join (cost=88.79..344.84 rows=187 width=933)
Hash Cond: (products_video.product_id = products_product.id)
-> Seq Scan on products_video (cost=0.00..245.41 rows=2341 width=8)
-> Hash (cost=88.26..88.26 rows=42 width=929)
-> Seq Scan on products_product (cost=0.00..88.26 rows=42 width=929)
Filter: ((NOT draft) AND finished)
(11 rows)
--- Update 8 June 2016 #2 ---
Since there are many suggestions by many people. So I'll try to apply the fixes one-by-one and report back periodically.
#e4c5
Here's the information you need:
You can think of my site somewhat like Udemy, an online course marketplace. There are "Product"(course). Each product contain a number of videos. Users can comment on both Product page itself and each Videos.
In many cases, I'll need to query a list of products order by number of TOTAL comments it got(the sum of product comments AND comments on each Video of that Product)
The django query that correspond to the EXPLAIN output above:
all_products_exclude_draft = Product.objects.all().filter(draft=False)
products_that_contain_more_than_8_videos = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]
I just noticed that I(or some other dev in my team) hit database twice with these 2 python lines.
Here's the django models for Product and Video:
from django_model_changes import ChangesMixin
class Product(ChangesMixin, models.Model):
class Meta:
ordering = ['-priority', '-hitcount']
seller = models.ForeignKey(SellerAccount)
priority = models.PositiveSmallIntegerField(default=1)
media = models.ImageField(blank=True,
null=True,
upload_to=download_media_location,
default=settings.MEDIA_ROOT + '/images/default_icon.png',
storage=FileSystemStorage(location=settings.MEDIA_ROOT))
active = models.BooleanField(default=True)
title = models.CharField(max_length=500)
slug = models.SlugField(max_length=200, blank=True, unique=True)
description = models.TextField()
product_coin_price = models.IntegerField(default=0)
sale_active = models.BooleanField(default=False)
sale_price = models.IntegerField(default=0, null=True, blank=True) #100.00
timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
draft = models.BooleanField(default=True)
hitcount = models.IntegerField(default=0)
finished = models.BooleanField(default=False)
is_marang_offline = models.BooleanField(default=False)
is_seller_beta_program = models.BooleanField(default=False)
def __unicode__(self):
return self.title
def get_avg_rating(self):
rating_avg = self.productrating_set.aggregate(Avg("rating"), Count("rating"))
return rating_avg
def get_total_comment_count(self):
comment_count = self.video_set.aggregate(Count("comment"))
comment_count['comment__count'] += self.comment_set.count()
return comment_count
def get_total_hitcount(self):
amount = self.hitcount
for video in self.video_set.all():
amount += video.hitcount
return amount
def get_absolute_url(self):
view_name = "products:detail_slug"
return reverse(view_name, kwargs={"slug": self.slug})
def get_product_share_link(self):
full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
return full_url
def get_edit_url(self):
view_name = "sellers:product_edit"
return reverse(view_name, kwargs={"pk": self.id})
def get_video_list_url(self):
view_name = "sellers:video_list"
return reverse(view_name, kwargs={"pk": self.id})
def get_product_delete_url(self):
view_name = "products:product_delete"
return reverse(view_name, kwargs={"pk": self.id})
#property
def get_price(self):
if self.sale_price and self.sale_active:
return self.sale_price
return self.product_coin_price
#property
def video_count(self):
videoCount = self.video_set.count()
return videoCount
class Video(models.Model):
seller = models.ForeignKey(SellerAccount)
title = models.CharField(max_length=500)
slug = models.SlugField(max_length=200, null=True, blank=True)
story = models.TextField(default=" ")
chapter_number = models.PositiveSmallIntegerField(default=1)
active = models.BooleanField(default=True)
featured = models.BooleanField(default=False)
product = models.ForeignKey(Product, null=True)
timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
draft = models.BooleanField(default=True)
hitcount = models.IntegerField(default=0)
objects = VideoManager()
class Meta:
unique_together = ('slug', 'product')
ordering = ['chapter_number', 'timestamp']
def __unicode__(self):
return self.title
def get_comment_count(self):
comment_count = self.comment_set.all_jing_jing().count()
return comment_count
def get_create_chapter_url(self):
return reverse("sellers:video_create", kwargs={"pk": self.id})
def get_edit_url(self):
view_name = "sellers:video_update"
return reverse(view_name, kwargs={"pk": self.id})
def get_video_delete_url(self):
view_name = "products:video_delete"
return reverse(view_name, kwargs={"pk": self.id})
def get_absolute_url(self):
try:
return reverse("products:video_detail", kwargs={"product_slug": self.product.slug, "pk": self.id})
except:
return "/"
def get_video_share_link(self):
full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
return full_url
def get_next_url(self):
current_product = self.product
videos = current_product.video_set.all().filter(chapter_number__gt=self.chapter_number)
next_vid = None
if len(videos) >= 1:
try:
next_vid = videos[0].get_absolute_url()
except IndexError:
next_vid = None
return next_vid
def get_previous_url(self):
current_product = self.product
videos = current_product.video_set.all().filter(chapter_number__lt=self.chapter_number).reverse()
next_vid = None
if len(videos) >= 1:
try:
next_vid = videos[0].get_absolute_url()
except IndexError:
next_vid = None
return next_vid
And here is the index of the Product and Video table I got from the command:
my_database_name=# \di
Note: this is photoshopped and include some other models as well.
--- Update 8 June 2016 #3 ---
#Jerzyk
As you suspected. After I inspect all my code again, I found that I indeed did a 'slicing-in-memory': I tried to shuffle the first 10 results by doing this:
def get_queryset(self):
all_product_list = Product.objects.all().filter(draft=False).annotate(
num_video=Count(
Case(
When(
video__draft=False,
then=1,
)
)
)
).order_by('-priority', '-num_video', '-hitcount')
the_first_10_products = list(all_product_list[:10])
the_11th_product_onwards = list(all_product_list[10:])
random.shuffle(copy)
finalList = the_first_10_products + the_11th_product_onwards
Note: in the code above I need to count number of Video that is not in draft status.
So this will be one of the thing I need to fix as well. Thanks. >_<
--- Here are the related screenshots ---
Postgres log when freezing occurs (log_min_duration = 500 milliseconds)
Postgres log (contunued from the above screenshot)
Nginx error.log in the same time period
DigitalOcean CPU usage graph just before freezing
DigitalOcean CPU usage graph just after freezing
We can jump to the conclusion that your problems are caused by the slow query in question. By itself each run of the query does not appear to be slow enough to cause timeouts. However it's possible several of these queries are executed concurrently and that could lead to the meltdown. There are two things that you can do to speed things up.
1) Cache the result
The result of a long running query can be cached.
from django.core.cache import cache
def get_8x_videos():
cache_key = 'products_videos_join'
result = cache.get(cache_key, None)
if not result:
all_products_exclude_draft = Product.objects.all().filter(draft=False)
result = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]
result = Product.objects.annotate('YOUR LONG QUERY HERE')
cache.set(cache_key, result)
return result
This query now comes from memcache (or whatever you use for caching) that means if you have two successive hits for the page that uses this in quick succession, the second one will have no impact on the database. You can control how long the object is cached in memory.
2) Optimize the Query
The first thing that leaps out at you from the explain is that you are doing sequential scan on both the products_products and product_videos tables. Usually sequential scans are less desirable than index scans. However an index scan may not be used on this query because of the COUNT() and HAVING COUNT() clauses you have on it as well as the massive GROUP BY clauses on it.
update:
Your query has a LEFT OUTER JOIN, It's possible that an INNER JOIN or a subquery might be faster, in order to do that, we need to recognize that grouping on the Video table on product_id can give us the set of videos that figure in at least 8 products.
inner = RawSQL('SELECT id from product_videos GROUP BY product_id HAVING COUNT(product_id) > 1',params=[])
Product.objects.filter(id__in=b)
The above eleminates the LEFT OUTER JOIN and introduces a subquery. However this doesn't give easy access to the actual number of videos for each product, so this query in it's present form may not be fully usable.
3) Improving indexes
While it may be tempting to create an index on draft and finished columns, this will be futile as those columns do not have sufficient cardinality to be good candidates for indexes. However it may still be possible to create a conditional index. Again the conclusion can only be drawn after seeing your tables.
*** Update 7 June 2016 : Issue occur again. CPU hit 100% and stays there. This answer does help with performance but unfortunately not the solution to this problem.
Thanks to the recommendation by DigitalOcean suppport team. I tried the configuration suggested by this tool:
http://pgtune.leopard.in.ua/
Which recommend me the following values for my droplet with 1 CPU core and 1GB RAM:
in postgresql.conf:
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 768MB
work_mem = 1310kB
maintenance_work_mem = 64MB
checkpoint_segments = 32
checkpoint_completion_target = 0.7
wal_buffers = 7864kB
default_statistics_target = 100
/etc/sysctl.conf
kernel.shmmax=536870912
kernel.shmall=131072
Until now my postgres server has been running fine for 3-4 days. So I assume this is the solution. Thanks everyone!

Django - What might cause the database to hang?

I have devised a local network website using the Django framework and recently ran in problems I was not having until recently.
We are running experiments on a local network collecting various measurements and I set up this website to make sure we are collecting all the data in the same place.
I set up a PostGreSQL database and use django to populate it on the fly as I receive measurements. The script that does that looks like:
**ladrLogger.py**
#various imports
import django
from django.db import IntegrityError
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")
django.setup()
from logger.models import Measurement, Device, Type , Room, Experiment, ExperimentData
def logDevice(self,port):
# Callback function executed each time I receive data to log it in the database
deviceData = port.data # get the data
# Do a bunch of tests and checks
# ....
# Get all measurement to add to the database
# measurements is a list of measurement as defined in my django models
measurements = self.prepareMeasurement(...)
self.saveMeasurements(measurements)
print "Saved measurements successfully."
def saveMeasurements(self,meas):
if not meas:
return
elif type(meas) is list:
for m in meas:
self.saveMeasurements(m)
elif type(meas) is Measurement:
try:
meas.save()
except IntegrityError as e:
if 'unique constraint' in e.message:
print "Skipping... Measurement already existed for device " + meas.device.name
else:
print "Skipping measurement due to error: " + e.message
def prepareMeasurement(self,nameDevice, typeDevice, time, data):
### Takes the characteristics of measurement (device, name and type) and creates the appropriate measurements.
measurements = []
m = Measurement()
m.device = Device.objects.get(name=nameDevice)
m.date = time
# Bunch of tests
# .....
for idv,v in enumerate(value):
if v in data:
m = Measurement()
m.device = something
m.date = something else
m.value = bla
m.quantity = blabla
measurements.append(m)
return measurements
# Bunch of other methods
Note that this script is always running and waiting for more measurements to execute the logDevice callback.
EDIT: A custom based library based on YARP takes care of the callback handling. Code to create the callbacks looks like this:
portid = self.createPort(quer.group(1),True,True) #creates a port
pyarp.connect(desc[0], self.fullPortPath(portid)) #establishes connection to talking port
self.listenToPort(portid, lambda port: self.logDevice(port)) #tells him to execute that callback when he receives messages'
Callbacks are entirely dealt with in the background.
On the other hand, I have my django website that has various views displaying devices, measurements, plotting and whatnot.
The problem I have is that I am logging my measurements (about a few(2-3) per second at some times, usually less) and I can see that logging seems to be fine. But when I am calling my views, for example asking for the latest measurement for device x, I get an old measurement. One example of code:
def latestTemp(request,device_id):
# Creates a csv file with the latest temperature measured
#### for now does not check what measurements are actually available
dev = get_object_or_404(Device, pk=device_id)
tz = pytz.timezone('Europe/Zurich')
# Create the HttpResponse object with the appropriate CSV header.
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename="%s.csv"' %dev.name
#get measurement
lastMeas = Measurement.objects.filter(device=dev, quantity=Type.objects.get(quantity='Temperature')).latest('date')
writer = csv.writer(response)
# Form list of required timesteps
date = lastMeas.date.astimezone(tz)
writer.writerow([date.strftime('%Y-%m-%d'),date.strftime('%H:%M'),lastMeas.value])
return response
EDIT (more precisions):
I have been logging data for a few hours, but the website only shows me something dating back a few hours. As I keep asking for that data, it gets more and more recent, as if it had been buffered somewhere and was now getting slowly is place and visible by the website, until everything finally comes back to normal. On the other hand, if I kill the logging process, the data seems lost for ever. What is strange though is that the logDevice method completes and I can see that the meas.save() commands were executed. I also tried to add a listener for the Django signal post.save and I catch them correctly.
Few information:
- I am using the postgresql backend
- I am running all of this on a dedicated Mac machine.
- let me know whatever else would be useful to know
My questions are:
- Do you se any reason that might happen (it used to not happen so I guess it might have to do with the database becoming big, 4Gb right now)
- As a side question, but maybe related, I suspect the way I am pushing new elements in the database is not really nice, since the code runs completely independently from the django website itself. Any suggestions on how to improve ? I thought the ladrLogger code could send a request to a dedicated view that create the new element but that might be heavier for no purpose.
EDIT: adding my models.py
class Room(models.Model):
fullName = models.CharField(max_length=20, unique=True)
shortName = models.CharField(max_length=5, unique=True, default = "000")
nickName = models.CharField(max_length=20, default="Random Room")
def __unicode__(self):
return self.fullName
class Type(models.Model):
quantity = models.CharField(max_length=100, default="Temperature", unique = True)
unit = models.CharField(max_length=5, default="C", blank=True)
VALUE_TYPES = (
('float', 'float'),
('boolean', 'boolean'),
('integer', 'integer'),
('string', 'string'),
)
value_type = models.CharField(max_length=20, choices=VALUE_TYPES, default = "float")
def __unicode__(self):
return self.quantity
class Device(models.Model):
name = models.CharField(max_length=30, default="Unidentified Device",unique=True)
room = models.ForeignKey(Room)
description = models.CharField(max_length=500, default="", blank=True,)
indigoId = models.CharField(max_length=30,default="000")
def __unicode__(self):
#r = Room.objects.get(pk = self.room)
return self.name #+ ' in room ' + r.name
def latestMeasurement(self,*args):
if len(args)==0:
#No argument so just return latest argument
meas = Measurement.objects.filter(device=self).latest('date')
else:
#Use first argument as the type
meas = Measurement.objects.filter(device=self, quantity=args[0]).latest('date')
if not meas:
return None
else:
return meas
def typeList(self):
return Type.objects.filter(measurement__device=self).distinct()
class Measurement(models.Model):
device = models.ForeignKey(Device)
date = models.DateTimeField(db_index=True)
value = models.CharField(max_length=100,default="")
quantity = models.ForeignKey(Type)
class Meta:
unique_together = ('date','device','quantity',)
index_together = ['date', 'device']
def __unicode__(self):
t = self.quantity
return str(self.value) + " " + self.quantity.unit
# return str(self.value)