Bulk updating a table

Bulk updating a table - django

I want to update a customer table with a spreadsheet from our accounting system. Unfortunately I can't just clear out the data and reload all of it, because there are a few records in the table that are not in the imported data (don't ask).
For 2000 records this is taking about 5 minutes, and I wondered if there was a better way of doing it.
for row in data:
try:
try:
customer = models.Retailer.objects.get(shared_id=row['Customer'])
except models.Retailer.DoesNotExist:
customer = models.Retailer()
customer.shared_id = row['Customer']
customer.name = row['Name 1']
customer.address01 = row['Street']
customer.address02 = row['Street 2']
customer.postcode = row['Postl Code']
customer.city = row['City']
customer.save()
except:
print formatExceptionInfo("Error with Customer ID: " + str(row['Customer']))

Look at my answer here: Django: form that updates X amount of models
The QuerySet has update() method - rest is explained in above link.

I've had some success using this bulk update snippet:
http://djangosnippets.org/snippets/446/
It's a bit outdated, but it worked on django 1.1, so I suppose you can still make it work. If you are looking for a quick way to do a one time bulk insert, this is the quickest (I'm not sure I'd trust it for regular use without seriously testing performance).

I've made a terribly crude attempt on a solution for this problem, but it's not finished yet and it doesn`t support working with django orm objects directly - yet.
http://pypi.python.org/pypi/dse/0.1.0
It`s not been properly testet and let me know if you have any suggestions on how to improve it. Using the django orm to do stuff like this is terrible.
Thomas

Related

Nested SQL queries in Django

I've got a working SQL query that I'm trying to write in Django (without resorting to RAW) and was hoping you might be able to help.
Broadly, I'm looking to next two queries - the first calculates a COUNT, and then I'm looking to calculate an AVERAGE of the COUNTS. (this'll give you the average number of items on a ticket, per location)
The SQL that works is:
SELECT location_name, Avg(subq.num_tickets) FROM (
SELECT Count(ticketitem.id) AS num_tickets, location.name AS location_name
FROM ticketitem
JOIN ticket ON ticket.id = ticketitem.ticket_id
JOIN location ON location.id = ticket.location_id
JOIN location ON location.id = location.app_location_id
GROUP BY ticket_id, location.name) AS subq
GROUP BY subq.location_name;
For my Django code, I'm trying something like this:
# Get the first count
qs = TicketItem.objects.filter(<my complicated filter>).\
values('ticket__location__app_location__name','posticket').\
annotate(num_tickets=Count('id'))
# now get the average of the count
qs2 = qs.values('ticket__location__app_location__name').\
annotate(Avg('num_tickets')).\
order_by('location__app_location__name')
but that fails because num_tickets doesn't exist ... Anyway - suspect I'm being slow. Would love someone to enlighten me!

Check out the section on aggregating annotations from the Django docs. Their example takes an average of a count.

I was playing around with this a bit in a manage.py shell, and I think the django ORM might not be able to do that kind of annotation. Honestly you're probably going to have to resort to doing a raw query or bind in something like https://github.com/Deepwalker/aldjemy which would let you do that via SQLAlchemy.
When I playing with this I tried
(my_model.objects.filter(...)
.values('parent_id', 'parent__name', 'thing')
.annotate(Count('thing'))
.values('name', 'thing__count')
.annotate(Avg('thing__count')))
Which gave a lovely traceback about FieldError: Cannot compute Avg('thing__count'): 'thing__count' is an aggregate, which makes sense since I doubt the ORM is trying to convert that first group by to a nested query.

Django Query without referencing fields

Quick overview:
I have a Forecast model setup which has a workflow_state.
Now I'm trying to query the Forecast for all the forecasts that are in a certain state AND the current person logged in is_staff.
If i was writing a raw query this wouldn't be an issue because i could write something like:
SELECT * FROM forecast WHERE forecast.workflow_state_id in (1,2,3,4) AND 1 = user.is_staff
However, when trying to write this in a queryset I can't figure out how to reference a constant. I don't want to write a raw queryset and if possible want to avoid using the extra field.
Any suggestions would be appreciated. Thanks

Your constant is only a True
Forecast.objects.filter(workflow_state__in=[1,2,3,4], user__is_staf=True)

Your edit makes things rather less than clear, but you seem to be asking how to do a check on the current logged-in user, rather than on the user referenced by the model. In which case, you don't do that in a query at all; your example SQL statement wouldn't work, and neither would doing it in the ORM. You do it in Python, of course:
if request.user.is_staff:
forecasts = Forecast.objects.filter(workflow_state__in=[1,2,3,4])

how to get latest foreign key value in models.py

I have a little problem with getting latest foreign key value in my django app. Here are my two models:
class Stock(models.Model):
...
class Dividend(models.Model):
date = models.DateField('pay date')
stock = models.ForeignKey(Stock, related_name="dividends")
class Meta:
ordering = ["date"]
I would like to get latest dividend from stock object. So basically this - stock.dividends.latest('date'). However, everytime I call stock.dividends.latest('date'), it fires up sql query to get latest dividend. I have latest() method in for cycle for every stock I have. I would like to avoid these sql queries. May I somehow define new method in class Stock that would get latest dividend within sql query for stock object?
I cannot change default ordering from "date" to "-date".
Using select_related('dividends') loads dividends objects with stock, but latest probably uses order_by and it requires sql query anyway. :(
EDIT1: To make more clear what I want, here is an example. Let's say I have 100 symbols in shares.keys():
for stock in Stock.objects.filter(symbol__in=shares.keys()): # 1 sql query
latest_dividend = stock.dividends.latest('date') # 100 sql queries
... #do something with latest dividend
Well and in some cases I might have 500 symbols in shares.keys(). That is why I need to avoid making sql queries on getting latest dividend for stock.

I have the same problem with you, so I tested many Django queries. Finally, I found out that we can use this:
Stock.objects.all().annotate(latest_date=Max('dividends__date')).filter(dividends__date=F('latest_date')).values('dividends')

I'm not sure my solution is the best, but here it is (works only with PostgreSQL):
stocks = list(Stock.objects.filter(**something))
dividends = Dividend.objects.filter(
stock__in=stocks,
).order_by(
'stock_id',
'-date'
).distinct(
'stock_id',
)
dividends_dict = {d.stock_id: d for d in dividends}
for stock in stocks:
stock.latest_dividend = dividends_dict.get(stock.id)

I'm a little confused by your question, I'm assuming you are trying to access the dividends from your stock object in order to limit your queries to the database. I believe that is the least number queries of possible.
stock_options = stock.objects.get(pk=your_query)
order_options = stock.dividend_set.order_by('-date')[:5]

likeon: Thanks for your answer. But I think I can avoid initializing that large dictionary (I have 5000 stocks and 280 000 dividends). But your list gave me an idea. Your code requires 2 sql queries. Here is my example (EDIT1).
for stock in Stock.objects.filter(symbol__in=shares.keys())\
.prefetch_related('dividends'): # 2 sql queries
latest_dividend = list(stock.dividends.all())[-1] # 0 sql queries
... #do something with latest_dividend
My code also requires 2 sql queries, but I do not have to reorder it and create list from stocks and all 280 000 dividends (I only create dict from current stock dividends every cycle). May be creating one dict is quicker than creating len(shares.keys()) dicts, not sure.
I thought there would be easier solution (avoid creating list/dictionary from dividends), but this is good enough for now. Thanks for answers!

As long as I understood you can do it this way:
stock.dividends.last()
as implementation in Django is like this:
def first(self):
"""Return the first object of a query or None if no match is found."""
for obj in (self if self.ordered else self.order_by('pk'))[:1]:
return obj
Also, you can use .latest(*fields, field_name=None) too.

Get most commented posts in django with django comments

I'm trying to get the ten most commented posts in my django app, but I'm unable to do it because I can't think a proper way.
I'm currently using the django comments framework, and I've seen a possibility of doing this with aggregate or annotate , but I can figure out how.
The thing would be:
Get all the posts
Calculate the number of comments per post (I have a comment_count method for that)
Order the posts from most commented to less
Get the first 10 (for example)
Is there any "simple" or "pythonic" way to do this? I'm a bit lost since the comments framework is only accesible via template tags, and not directly from the code (unless you want to modify it)
Any help is appreciated

You're right that you need to use the annotation and aggregation features. What you need to do is group by and get a count of the object_pk of the Comment model:
from django.contrib.comments.models import Comment
from django.db.models import Count
o_list = Comment.objects.values('object_pk').annotate(ocount=Count('object_pk'))
This will assign something like the following to o_list:
[{'object_pk': '123', 'ocount': 56},
{'object_pk': '321', 'ocount': 47},
...etc...]
You could then sort the list and slice the top 10:
top_ten_objects = sorted(o_list, key=lambda k: k['ocount'])[:10]
You can then use the values in object_pk to retrieve the objects that the comments are attached to.

Annotate is going to be the preferred way, partially because it will reduce db queries and it's basically a one-liner. While your theoretical loop would work, I bet your comment_count method relies on querying comments for a given post, which would be 1 query per post that you loop over- nasty!
posts_by_score = Comment.objects.filter(is_public=True).values('object_pk').annotate(
score=Count('id')).order_by('-score')
post_ids = [int(obj['object_pk']) for obj in posts_by_score]
top_posts = Post.objects.in_bulk(post_ids)
This code is shameless adapted from Django-Blog-Zinnia (no affiliation)

fast lookup for the last element in a Django QuerySet?

I've a model called Valor. Valor has a Robot. I'm querying like this:
Valor.objects.filter(robot=r).reverse()[0]
to get the last Valor the the r robot. Valor.objects.filter(robot=r).count() is about 200000 and getting the last items takes about 4 seconds in my PC.
How can I speed it up? I'm querying the wrong way?

The optimal mysql syntax for this problem would be something along the lines of:
SELECT * FROM table WHERE x=y ORDER BY z DESC LIMIT 1
The django equivalent of this would be:
Valor.objects.filter(robot=r).order_by('-id')[:1][0]
Notice how this solution utilizes django's slicing method to limit the queryset before compiling the list of objects.

If none of the earlier suggestions are working, I'd suggest taking Django out of the equation and run this raw sql against your database. I'm guessing at your table names, so you may have to adjust accordingly:
SELECT * FROM valor v WHERE v.robot_id = [robot_id] ORDER BY id DESC LIMIT 1;
Is that slow? If so, make your RDBMS (MySQL?) explain the query plan to you. This will tell you if it's doing any full table scans, which you obviously don't want with a table that large. You might also edit your question and include the schema for the valor table for us to see.
Also, you can see the SQL that Django is generating by doing this (using the query set provided by Peter Rowell):
qs = Valor.objects.filter(robot=r).order_by('-id')[0]
print qs.query
Make sure that SQL is similar to the 'raw' query I posted above. You can also make your RDBMS explain that query plan to you.

It sounds like your data set is going to be big enough that you may want to denormalize things a little bit. Have you tried keeping track of the last Valor object in the Robot object?
class Robot(models.Model):
# ...
last_valor = models.ForeignKey('Valor', null=True, blank=True)
And then use a post_save signal to make the update.
from django.db.models.signals import post_save
def record_last_valor(sender, **kwargs):
if kwargs.get('created', False):
instance = kwargs.get('instance')
instance.robot.last_valor = instance
post_save.connect(record_last_valor, sender=Valor)
You will pay the cost of an extra db transaction when you create the Valor objects but the last_valor lookup will be blazing fast. Play with it and see if the tradeoff is worth it for your app.

Well, there's no order_by clause so I'm wondering about what you mean by 'last'. Assuming you meant 'last added',
Valor.objects.filter(robot=r).order_by('-id')[0]
might do the job for you.

django 1.6 introduces .first() and .last():
https://docs.djangoproject.com/en/1.6/ref/models/querysets/#last
So you could simply do:
Valor.objects.filter(robot=r).last()

Quite fast should also be:
qs = Valor.objects.filter(robot=r) # <-- it doesn't hit the database
count = qs.count() # <-- first hit the database, compute a count
last_item = qs[ count-1 ] # <-- second hit the database, get specified rownum
So, in practice you execute only 2 SQL queries ;)

Model_Name.objects.first()
//To get the first element
Model_name.objects.last()
//For get last()
in my case, the last is not work because there is only one row in the database
maybe help full for you too :)

Is there a limit clause in django? This way you can have the db, simply return a single record.
mysql
select * from table where x = y limit 1
sql server
select top 1 * from table where x = y
oracle
select * from table where x = y and rownum = 1
I realize this isn't translated into django, but someone can come back and clean this up.

The correct way of doing this, is to use the built-in QuerySet method latest() and feeding it whichever column (field name) it should sort by. The drawback is that it can only sort by a single db column.
The current implementation looks like this and is optimized in the same sense as #Aaron's suggestion.
def latest(self, field_name=None):
"""
Returns the latest object, according to the model's 'get_latest_by'
option or optional given field_name.
"""
latest_by = field_name or self.model._meta.get_latest_by
assert bool(latest_by), "latest() requires either a field_name parameter or 'get_latest_by' in the model"
assert self.query.can_filter(), \
"Cannot change a query once a slice has been taken."
obj = self._clone()
obj.query.set_limits(high=1)
obj.query.clear_ordering()
obj.query.add_ordering('-%s' % latest_by)
return obj.get()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bulk updating a table - django

Look at my answer here: Django: form that updates X amount of models The QuerySet has update() method - rest is explained in above link.

Related

Nested SQL queries in Django

Django Query without referencing fields

how to get latest foreign key value in models.py

Get most commented posts in django with django comments

fast lookup for the last element in a Django QuerySet?

Categories

Resources