Django inserting large dataset into model - how to delay commit?

Django inserting large dataset into model - how to delay commit? - django

I am working on a project where I need to insert large files into models (sometimes several gigabytes). Because the files can be large the approach I am taking is to read it linewise, and then insert it into the Django model.
However, when an error is encountered in the process, how do I cancel the whole operation? What is the proper way to make sure that the rows are committed after the entire file is processed without errors.
The other alternative is to create all the model objects in one go and insert it in bulk, is this feasible for large datasets? How would it work.
Here's my code:
class mymodel(models.Model):
fkey1 = models.ForeignKey(othermodel1,on_delete=models.CASCADE)
fkey2= models.ForeignKey(othermodel2,on_delete=models.CASCADE)
field 1= models.CharField(max_length=25,blank=False)
field 2= models.DateField(blank=False)
...
Field 12= models.FloatField(blank=False)
And inserting data into the model from excel:
wb=load_workbook(datafile, read_only=True, data_only=True)
ws=wb.get_sheet_by_name(sheetName)
for row in ws.rows:
if isthisheaderrow(row):
#determine column arrangement and pass to next
break
for row in ws.rows:
if isthisheaderrow(row):
pass
elif isThisValidDataRow(row):
relevantRow=<create a list of values>
dictionary=dict(zip(columnNames,relevantRow))
dictionary['fkey1']=othermodel1Object
dictionary['fkey2']=othermodel2Object
mymodel(**dictionary).save()

I should have looked harder, the commit can be delayed by a decorator #transaction.atomic. A more detailed description is given here: https://docs.djangoproject.com/en/1.11/topics/db/transactions/
The code being:
wb=load_workbook(trfile, read_only=True, data_only=True)
ws=wb.get_sheet_by_name(sheetName)
revenueSwitch=True
for row in ws.rows:
if ifHeaderReturnIndex(row,desiredColumns):
selectedIndex=ifHeaderReturnIndex(row, desiredColumns)
outputColumnNames=[row[i].value.replace(" ", "") for i in selectedIndex]
#output_ws.append(outputColumnNames)
break
#transaction.atomic
def insertrows():
for row in ws.rows:
if ifHeaderReturnIndex(row,desiredColumns):
pass
elif isRowValid(row,selectedIndex):
newrow=[row[i].value for i in selectedIndex]
dictionary=dict(zip(outputColumnNames,newrow))
dictionary['UniqueRunID']=run
dictionary['SourceFileObject']=TrFile
TransactionData(**dictionary).save()
insertrows()

Related

How to find duplicate and deactivate duplicates for user attributes

Suppose we have a model in django defined as follows:
class DateClass:
user_id = models.IntegerField(...)
sp_date = models.DateField(...)
is_active = models.BooleanField(...)
...
I follow insert policy here, i.e, for a specific user there will be only one specific active date. That means, there will be only one active row for user=1 at date table for sp_date values 27/10/2021, 28/10/2021 and so one. There shouldn't be two active rows for 27/10/2021 for user=1, but for other users have there rows for 27/10/2021. Whenever a date has to be updated, I deactivate (is_active=False) the previous row and add a new row for specific date.
I want to find duplicate active dates for each users in one single query, and then deactivate (set is_active=False) all the duplicate values except the last row (The row which was last inserted). Two rows will be duplicate if the values of user_id and sp_date are equal and both have is_active=True. I know how to find duplicates for a specific column which is fairly easy. But I can't think of something which can do the above task elegantly. I can only think of following approach:
for user in users:
dates = DateClass(user_id=user.id, is_active=True)
for date in dates:
days = dates.filter(
sp_date=date.sp_date, is_active=True
)
if days.count() > 1:
last_day = days.last()
days.exclude(id=last_day.id).update(is_active=False)
As you can see above one is not that efficient, as I have to loop through all users. Is there any way to do this more efficiently? I am using PostgreSQL for database.

There a great answer for multiple duplicate fields queryset from this answer as i don't want to take the credit and also don't want to reinvent the wheel, so i will suggest that answer
For your case it should be:
from django.db.models import Max, Count
duplicate_date_class = DateClass.objects.values('user_id', 'sp_date') \
.annotate(records=Count('user_id')) \
.filter(records__gt=1)
# Then do operations on duplicates
for date_class in duplicate_date_class:
DateClass.objects.filter(
user_id=date_class['user_id'],
sp_date=date_class['sp_date']
)[1:].update(is_active=False)
If you want to avoid having duplicate set of multiple fields, i suggest taking a look at unique_together for model validation

Join two records from same model in django queryset

Been searching the web for a couple hours now looking for a solution but nothing quite fits what I am looking for.
I have one model (simplified):
class SimpleModel(Model):
name = CharField('Name', unique=True)
date = DateField()
amount = FloatField()
I have two dates; date_one and date_two.
I would like a single queryset with a row for each name in the Model, with each row showing:
{'name': name, 'date_one': date_one, 'date_two': date_two, 'amount_one': amount_one, 'amount_two': amount_two, 'change': amount_two - amount_one}
Reason being I would like to be able to find the rank of amount_one, amount_two, and change, using sort or filters on that single queryset.
I know I could create a list of dictionaries from two separate querysets then sort on that and get the ranks from the index values ...
but perhaps nievely I feel like there should be a DB solution using one queryset that would be faster.
union seemed promising but you cannot perform some simple operations like filter after that
I think I could perhaps split name into its own Model and generate queryset with related fields, but I'd prefer not to change the schema at this stage. Also, I only have access to sqlite.
appreciate any help!

Your current model forces you to have ONE name associated with ONE date and ONE amount. Because name is unique=True, you literally cannot have two dates associated with the same name
So if you want to be able to have several dates/amounts associated with a name, there are several ways to proceed
Idea 1: If there will only be 2 dates and 2 amounts, simply add a second date field and a second amount field
Idea 2: If there can be an infinite number of days and amounts, you'll have to change your model to reflect it, by having :
A model for your names
A model for your days and amounts, with a foreign key to your names
Idea 3: You could keep the same model and simply remove the unique constraint, but that's a recipe for mistakes
Based on your choice, you'll then have several ways of querying what you need. It depends on your final model structure. The best way to go would be to create custom model methods that query the 2 dates/amount, format an array and return it

Counting the number of related objects with a certain value in Django

This are simplified models to demonstrate my problem:
class User(models.Model):
username = models.CharField(max_length=30)
total_readers = models.IntegerField(default=0)
class Book(models.Model):
author = models.ForeignKey(User)
title = models.CharField(max_length=100)
class Reader(models.Model):
user = models.ForeignKey(User)
book = models.ForeignKey(Book)
So, we have Users, Books and Readers (Users, who have read a Book). Thus, Reader is basically a many-to-many relationship between Book and User.
Now let's say, the current user reads a book. Now, I'd like to update the number of total readers for all books of this book's author:
# get the book (as an example pk=1)
book = Book.objects.get(pk=1)
# save Reader object for this user and this book
Reader(user=request.user, book=book).save()
# count and save the total number of readers for this author in all his books
book.author.total_readers = Reader.objects.filter(book__author=book.author).count()
book.author.save()
By doing so, Django creates a LEFT OUTER JOIN query for PostgreSQL and we get the expected result. However, the database tables are huge and this has become a bottleneck.
In this example, we could simply increase the total_readers by one on each view, instead of actually counting the database rows. However, this is just a simplified model structure and we cannot do this in reality here.
What I can do, is creating another field in the Reader model called book_author_id. Thus, I denormalize data and can count the Reader objects without having PostgreSQL making the LEFT OUTER JOIN with the User table.
Finally, here's my question: Is it possible to create some sort of database index, so that PostgreSQL handles this denormalization automatically? Or do I really have to create this additional model field and redundantly store the author's PK in there?
EDIT - to point out the essential question: I got several great answers, which work for a lot of scenarios. However, they don't solve this actual problem. The only thing I'd like to know, is if it's possible to have PostgreSQL handle such a denormalization automatically - e.g. by creating some sort of database index.

Sometimes, this query can serve better:
book.author.total_readers = Reader.objects.filter(book__in=Book.objects.filter(author=book.author)).count()
That will generate query with sub-query, sometimes it will have better performance that query with join. You even go further and end up creating 2 queries separately:
book.author.total_readers = Reader.objects.filter(book_id__in=Book.objects.filter(author=book.author).values_list('id', flat=True)).count()
That will generate 2 queries, one will retrieve list of all book IDs for that author and second will retrieve count of reads for books with ID in that list.

Good solution also may be to create some batch task that will run for example once per hour and count up all reads, but that way you will end up with not live refreshing count of reads.
You can also create celery task that will run just after read is created to generate new value for author. That way you won't have long response time and delay from creating read to counting it up won't be so long.

It's always way better to solve bottlenecks of this sort with good design and maybe a little bit of caching rather than duplicating data in the way you suggest. The total_readers field is data you should generate instead of recording.
class User(models.Model):
username = models.CharField(max_length=30)
#property
def total_readers(self):
cached_value = caching_client.get("readers_"+self.username, None)
if cached_value is None:
cached_value = self.readers()
caching_client.set("readers_"+self.username,
cached_value)
return cached_value
def readers(self):
return Reader.objects.filter(book__author__user=self).count()
There are libraries that do the caching via decorators but I felt it was a pattern you would benefit from seeing expressly. You can also attach a TTL to the cache so that you insure that the value can't be wrong for longer than TTL. You can also regenerate the cache upon creation of a Reader object.
You might actually get some mileage with declaring an m2m and defining through relationships but I have no experience of it.

Django ORM: Retrieving posts and latest comments without performing N+1 queries

I have a very standard, basic social application -- with status updates (i.e., posts), and multiple comments per post.
Given the following simplified models, is it possible, using Django's ORM, to efficiently retrieve all posts and the latest two comments associated with each post, without performing N+1 queries? (That is, without performing a separate query to get the latest comments for each post on the page.)
class Post(models.Model):
title = models.CharField(max_length=255)
text = models.TextField()
class Comment(models.Model):
text = models.TextField()
post = models.ForeignKey(Post, related_name='comments')
class Meta:
ordering = ['-pk']
Post.objects.prefetch_related('comments').all() fetches all posts and comments, but I'd like to retrieve a limited number of comments per post only.
UPDATE:
I understand that, if this can be done at all using Django's ORM, it probably must be done with some version of prefetch_related. Multiple queries are totally okay, as long as I avoid making N+1 queries per page.
What is the typical/recommended way of handling this problem in Django?
UPDATE 2:
There seems to be no direct and easy way to do this efficiently with a simple query using the Django ORM. There are a number of helpful solutions/approaches/workarounds in the answers below, including:
Caching the latest comment IDs in the database
Performing a raw SQL query
Retrieving all comment IDs and doing the grouping and "joining" in python
Limiting your application to displaying the latest comment only
I didn't know which one to mark as correct because I haven't gotten a chance to experiment with all of these methods yet -- but I awarded the bounty to hynekcer for presenting a number of options.
UPDATE 3:
I ended up using #user1583799's solution.

If you're using Django 1.7 the new Prefetch objects—allowing you to customize the prefetch queryset—could prove helpful.
Unfortunately I can't think of a simple way to do exactly what you're asking. If you're on PostgreSQL and are willing to get just the latest comment for each post, the following should work in two queries:
comments = Comment.objects.order_by('post_id', '-id').distinct('post_id')
posts = Post.objects.prefetch_related(Prefetch('comments',
queryset=comments,
to_attr='latest_comments'))
for post in posts:
latest_comment = post.latest_comments[0] if post.latest_comments else None
Another variation: if your comments had a timestamp and you wanted to limit the comments to the most recent ones by date, that would look something like:
comments = Comment.objects.filter(timestamp__gt=one_day_ago)
...and then as above. Of course, you could still post-process the resulting list to limit the display to a maximum of two comments.

This solution is optimized for memory requirements, as you expect it important. It needs three queries. The first query asks for posts, the second query only for tuples (id, post_id). The third for details of filtered latest comments.
from itertools import groupby, islice
posts = Post.objects.filter(...some your flter...)
# sorted by date or by id
all_comments = (Comment.objects.filter(post__in=posts).values('post_id')
.order_by('post_id', '-pk'))
last_comments = []
# the queryset is evaluated now. Only about 100 itens chunks are in memory at
# once during iterations.
for post_id, related_comments in groupby(all_comments(), lambda x: x.post_id):
last_comments.extend(islice(related_comments, 2))
results = {}
for comment in Comment.objects.filter(pk__in=last_comments):
results.setdefault(comment.post_id, []).append(comment)
# output
for post in posts:
print post.title, [x.comment for x in results[post.id]]
But I think it will be faster for many database backends to combine the second and the third query into one and so to ask immediately for all fields of comments. Unuseful comments will be forgotten immediately.
The fastest solution would be with nested queries. The algorithm is like the one above, but everything is realized by raw SQL. It is limited only to some backends like PostgresQL.
EDIT
I agree that is not useful for you
... prefetch loads into memory thousands of comments, 99% of which will not be shown.
and therefore I wrote that relatively complicated solution that 99% of them will be read continuously without loading into memory.
EDIT
All examples are for the condition that you wand post_id in [1, 3, 5] (enything selected earlier by categories etc.)
In all cases create the index for Comments on fields ['post', 'pk']
A) Nested query for PostgresQL
SELECT post_id, id, text FROM
(SELECT post_id, id, text, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2
ORDER BY post_id, id
Or explicitely require with less memory if we don't believe the optimizer. It should read data only from index in two inner selects, which is much less data than from the table.:
SELECT post_id, id, text FROM app_comment WHERE id IN
(SELECT id FROM
(SELECT id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2)
ORDER BY post_id, id
B) With a cached ID of the oldest displayed comment
Add field "oldest_displayed" to Post
class Post(models.Model):
oldest_displayed = models.IntegerField()
Filter comments for pk if interesting posts (that you have selected earlier by categories etc.)
Filter
from django.db.models import F
qs = Comment.objects.filter(
post__pk__in=[1, 3, 5],
post__oldest_displayed__lte=F('pk')
).order_by('post_id', 'pk')
pprint.pprint([(x.post_id, x.pk) for x in qs])
Hmm, very nice ... and how it is compiled by Django?
>>> print(qs.query.get_compiler('default').as_sql()[0]) # added white space
SELECT "app_comment"."id", "app_comment"."text", "app_comment"."post_id"
FROM "app_comment"
INNER JOIN "app_post" ON ( "app_comment"."post_id" = "app_post"."id" )
WHERE ("app_comment"."post_id" IN (%s, %s, %s)
AND "app_post"."oldest_displayed" <= ("app_comment"."id"))
ORDER BY app_comment"."post_id" ASC, "app_comment"."id" ASC
Prepare all "oldest_displayed" by one nested SQL initially (and set zero for posts with less than two comments):
UPDATE app_post SET oldest_displayed = 0
UPDATE app_post SET oldest_displayed = qq.id FROM
(SELECT post_id, id FROM
(SELECT post_id, id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
FROM app_comment ) sub
WHERE rank = 2) qq
WHERE qq.post_id = app_post.id;

prefetch_related('comments') will fetch all comments of the posts.
I had the same problem, and the database is Postgresql. I found a way:
Add a extra fieldrelated_replies. Note the FieldType is ArrayField, which support in django1.8dev. I copy the code to my project(the version of django is 1.7), just change 2 lines, it works.(or use djorm-pg-array )
class Post(models.Model):
related_replies = ArrayField(models.IntegerField(), size=10, null=True)
And use two queries:
posts = model.Post.object.filter()
related_replies_id = chain(*[p.related_replies for p in posts])
related_replies = models.Comment.objects.filter(
id__in=related_replies_id).select_related('created_by')[::1] # cache queryset
for p in posts:
p.get_related_replies = [r for r in related_replies if r.post_id == p.id]
When new comment comes, update related_replies.

Django Unique Bulk Inserts

I need to be able to quickly bulk insert large amounts of records quickly, while still ensuring uniqueness in the database. The new records to be inserted have already been parsed, and are unique. I'm hoping there is a way to enforce uniqueness at the database level, and not in the code itself.
I'm using MySQL as the database backend. If django supports this functionality in any other database, I am flexible in changing the backend, as this is a requirement.
Bulk inserts in Django don't use the save method, so how can I insert several hundred to several thousand records at a time, while still respecting unique fields and unique together fields?
My model structures, simplified, look something like this:
class Example(models.Model):
Meta:
unique_together = (('name', 'number'),)
name = models.CharField(max_length = 50)
number = models.CharField(max_length = 10)
...
fk = models.ForeignKey(OtherModel)
Edit:
The records that aren't already in the database should be inserted, and the records that already existed should be ignored.

As miki725's mentioned you don't have a problem with your current code.
I'm assuming you are using the bulk_create method. It is true that the save() method is not called when using bulk_create, but the uniqueness of fields is not enforced inside the save() method. When you use unique_together a unique constraint is added to the underlying table in mysql when creating the table:
Django:
unique_together = (('name', 'number'),)
MySQL:
UNIQUE KEY `name` (`name`,`number`)
So if you insert a value into the table using any method (save, bulk_insert or even raw sql) you will get this exception from mysql:
Duplicate entry 'value1-value2' for key 'name'
UPDATE:
What bulk_insert does is that it creates one big query that inserts all the data at once with one query. So if one of the entries is duplicate, it throws an exception and none of the data is inserted.
1- One option is to use batch_size parameter of bulk_insert and make it insert the data in a number of batches so that if one of them fails you only miss rest of the data of that batch. (depends how important it is to insert all the data and how frequent the duplicate entries are)
2- Another option is to write a for loop over the bulk data and insert the bulk data one by one. This way the exception is thrown for that one row only and the rest of the data is inserted. This is gonna query the db every time and is of course a lot slower.
3- Third option is to lift the unique constraint, insert the data using bulk_create and then write a simple query that deletes the duplicate rows.

Django itself does not enforce the unique_together meta attribute. This is enforced by the database using the UNIQUE clause. You can insert as much data as you want and you are guaranteed that the specified fields will be unique. If not, then an exception will be raised (not sure which one). More about unique_together in the docs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js