Consider a simple ForeignKey relationship:
class A(Model):
pass
class B(Model):
a = ForeignKey(A)
I have an API view that creates an A and a set of B's based on outside data (data NOT passed from the user), then serializes the created objects and returns the serialized data. My object creation code looks something like:
a = A()
a.b_set.bulk_create(B(a=a) for b in [...])
My issue is that this does not add the B objects to a's b_set, so that if I were to run
print(a.b_set.all())
afterwards, it would re-query the DB to get b_set. This is unnecessary though, because I already have a's entire b_set as I just created it. I'm doing this with a series of nested objects so it results in a LOT of unnecessary queries. My current workaround is to, after creation, run a query like
A.objects.prefetch_related('b_set').get(a=a.id)
then serializer that fetched object. This limits serializtion to just one unnecessary query, but I'd like to eliminate that one as well. It seems to me like there should be a way to cache the created B objects on a, and eliminate
any need to hit the DB again during serialization.
I believe you need to first execute a.save() before you can bulk_create. Here are my results using the two models you described:
a = A()
a.save()
a.b_set.bulk_create([B(a=a), B(a=a), B(a=a)])
a.b_set.count()
>>> 3
After some investigation into the QuerySet and Model source code, I decided my best/only option was to directly modify _prefetched_objects_cache on each object. Definitely not pretty but it works. Here's the gist of what I did:
a = A()
b_set = a.b_set.bulk_create(B(a=a) for b in [...])
a._prefetch_related_cache = {}
a._prefetch_related_cache['b_set'] = b_set
This ensures that all the created B's are cached on a. Note that if B has an auto-created primary key field, those fields won't be populated in the objects returned by bulk_create with most backends. Fortunately I'm using PostgreSQL which returns auto-PKs from bulk_create so that's not a problem for me.
Related
In the Spring framework when an object A has a ManyToOne relation to object B:
public class objectA{
private Long id;
#ManyToOne
private objectB objectB;
}
Then that objectB is only fetched when the getter method for objectB is called somewhere i nthe code. In that case an extra SQL query is made to fetch objectB of that particular objectA instance. You can change that behaviour by changing the fetch strategy to EAGER, in that case objectB is also fetched during the initial query to fetch objectA.
My question is: What is the default fetch style that Django uses? Does Django always fetch any referenced parent objects or childobjects eagerly? Or does it differ per relation?
Thank you
Many to One in Django is defined by the ForeignKey field
By default it fetches it lazily, meaning only when you are accessing that field.
The equivalent of EAGER is .select_related('objectB'). Docs here
select_related()
Returns a QuerySet that will “follow” foreign-key relationships, selecting additional related-object data when it executes its query. This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries.
The following examples illustrate the difference between plain lookups and select_related() lookups. Here’s standard lookup:
# Hits the database.
e = Entry.objects.get(id=5)
# Hits the database again to get the related Blog object.
b = e.blog
And here’s select_related lookup:
# Hits the database.
e = Entry.objects.select_related('blog').get(id=5)
# Doesn't hit the database, because e.blog has been prepopulated
# in the previous query.
b = e.blog
I'm trying to find an optimal way to execute a query, but got myself confused with the prefetch_related and select_related use cases.
I have a 3 table foreign key relationship: A -> has 1-many B h-> as 1-many C.
class A(models.model):
...
class B(models.model):
a = models.ForeignKey(A)
class C(models.model):
b = models.ForeignKey(B)
data = models.TextField(max_length=50)
I'm trying to get a list of all C.data for all instances of A that match a criteria (an instance of A and all its children), so I have something like this:
qs1 = A.objects.all().filter(Q(id=12345)|Q(parent_id=12345))
qs2 = C.objects.select_related('B__A').filter(B__A__in=qs1)
But I'm wary of the (Prefetch docs stating that:
any subsequent chained methods which imply a different database query
will ignore previously cached results, and retrieve data using a fresh
database query
I don't know if that applies here (because I'm using select_related), but reading it makes it seem as if anything gained from doing select_related is lost as soon as I do the filter.
Is my two-part query as optimal as it can be? I don't think I need prefetch as far as I'm aware, although I noticed I can swap out select_related with prefetch_related and get the same result.
I think your question is driven by a misconception. select_related (and prefetch_related) are an optimisation, specifically for returning values in related models along with the original query. They are never required.
What's more, neither has any impact at all on filter. Django will automatically do the relevant joins and subqueries in order to make your query, whether or not you use select_related.
I've been trying to create entities with a property which should be unique or None something similar to:
class Thing(ndb.Model):
something = ndb.StringProperty()
unique_value = ndb.StringProperty()
Since ndb has no way to specify that a property should be unique it is only natural that I do this manually like this:
def validate_unique(the_thing):
if the_thing.unique_value and Thing.query(Thing.unique_value == the_thing.unique_value).get():
raise NotUniqueException
This works like a charm until I want to do this in an ndb transaction which I use for creating/updating entities. Like:
#ndb.transactional
def create(the_thing):
validate_unique(the_thing)
the_thing.put()
However ndb seems to only allow ancestor queries, the problem is my model does not have an ancestor/parent. I could do the following to prevent this error from popping up:
#ndb.non_transactional
def validate_unique(the_thing):
...
This feels a bit out of place, declaring something to be a transaction and then having one (important) part being done outside of the transaction. I'd like to know if this is the way to go or if there is a (better) alternative.
Also some explanation as to why ndb only allows ancestor queries would be nice.
Since your uniqueness check involves a (global) query it means it's subject to the datastore's eventual consistency, meaning it won't work as the query might not detect freshly created entities.
One option would be to switch to an ancestor query, if your expected usage allows you to use such data architecture, (or some other strongly consistent method) - more details in the same article.
Another option is to use an additional piece of data as a temporary cache, in which you'd store a list of all newly created entities for "a while" (giving them ample time to become visible in the global query) which you'd check in validate_unique() in addition to those from the query result. This would allow you to make the query outside the transaction and only enter the transaction if uniqueness is still possible, but the ultimate result is the manual check of the cache, inside the transaction (i.e. no query inside the transaction).
A 3rd option exists (with some extra storage consumption as the price), based on the datastore's enforcement of unique entity IDs for a certain entity model with the same parent (or no parent at all). You could have a model like this:
class Unique(ndb.Model): # will use the unique values as specified entity IDs!
something = ndb.BooleanProperty(default=False)
which you'd use like this (the example uses a Unique parent key, which allows re-using the model for multiple properties with unique values, you can drop the parent altogether if you don't need it):
#ndb.transactional
def create(the_thing):
if the_thing.unique_value:
parent_key = get_unique_parent_key()
exists = Unique.get_by_id(the_thing.unique_value, parent=parent_key)
if exists:
raise NotUniqueException
Unique(id=the_thing.unique_value, parent=parent_key).put()
the_thing.put()
def get_unique_parent_key():
parent_id = 'the_thing_unique_value'
parent_key = memcache.get(parent_id)
if not parent_key:
parent = Unique.get_by_id(parent_id)
if not parent:
parent = Unique(id=parent_id)
parent.put()
parent_key = parent.key
memcache.set(parent_id, parent_key)
return parent_key
I have a 2 models with a foreign/Primary key to same model.
model Foo:
FK(Too, pk)
model Coo:
FK(Too, pk)
model Too:
blah = charfield()
In the views I am seeing some very strange behavior. I think I am doing something very wrong.
I want to replicate a object of Too and then save it. For e.g.
too = Too.create(blah="Awesome")
too.save()
foo = Foo.create(too=too)
foo.save()
too.id = None #Copy the original
too.save()
coo = Coo.create(too=too)
coo.save()
print foo.too.id
print coo.too.id
#above 2 print statements give same id
When I check in the admin the both foo and coo have different too object saved. But while printing it is showing the same. Why is that happening. I think I am doing something fundamentally wrong.
Django looks at the primary key to determine uniqueness, so work with that directly:
too.pk = None
too.save()
Setting the primary key to None will cause Django to perform an INSERT, saving a new instance of the model, rather than an UPDATE to the existing instance.
Source: https://stackoverflow.com/a/4736172/1533388
UPDATE: err, using pk and id are interchangeable in this case, so you'll get the same result. My first answer didn't address your question.
The discrepancy here is between what is occurring in python vs. what can be reconstituted from the database.
Your code causes Django to save two unique objects to the database, but you're only working with one python Too instance. When foo.save() occurs, the database entry for 'foo' is created with a reference to the DB entry for the first Too object. When coo.save() occurs, the database entry for 'coo' is created, pointing to the second, unique Too object that was stored via:
too.id = None #Copy the original
too.save()
However, in python, both coo and foo refer to the same object, named 'too', via their respective '.too' attributes. In python, there is only one 'Too' instance. So when you update too.id, you're updating one object, referred to by both coo and foo.
Only when the models are reconstituted from the database (as the admin view does in order to display them) are unique instances created for each foreign key; this is why the admin view shows two unique saved instances.
I've always found the Django orm's handling of subclassing models to be pretty spiffy. That's probably why I run into problems like this one.
Take three models:
class A(models.Model):
field1 = models.CharField(max_length=255)
class B(A):
fk_field = models.ForeignKey('C')
class C(models.Model):
field2 = models.CharField(max_length=255)
So now you can query the A model and get all the B models, where available:
the_as = A.objects.all()
for a in the_as:
print a.b.fk_field.field2 #Note that this throws an error if there is no B record
The problem with this is that you are looking at a huge number of database calls to retrieve all of the data.
Now suppose you wanted to retrieve a QuerySet of all A models in the database, but with all of the subclass records and the subclass's foreign key records as well, using select_related() to limit your app to a single database call. You would write a query like this:
the_as = A.objects.select_related("b", "b__fk_field").all()
One query returns all of the data needed! Awesome.
Except not. Because this version of the query is doing its own filtering, even though select_related is not supposed to filter any results at all:
set_1 = A.objects.select_related("b", "b__fk_field").all() #Only returns A objects with associated B objects
set_2 = A.objects.all() #Returns all A objects
len(set_1) > len(set_2) #Will always be False
I used the django-debug-toolbar to inspect the query and found the problem. The generated SQL query uses an INNER JOIN to join the C table to the query, instead of a LEFT OUTER JOIN like other subclassed fields:
SELECT "app_a"."field1", "app_b"."fk_field_id", "app_c"."field2"
FROM "app_a"
LEFT OUTER JOIN "app_b" ON ("app_a"."id" = "app_b"."a_ptr_id")
INNER JOIN "app_c" ON ("app_b"."fk_field_id" = "app_c"."id");
And it seems if I simply change the INNER JOIN to LEFT OUTER JOIN, then I get the records that I want, but that doesn't help me when using Django's ORM.
Is this a bug in select_related() in Django's ORM? Is there any work around for this, or am I simply going to have to do a direct query of the database and map the results myself? Should I be using something like Django-Polymorphic to do this?
It looks like a bug, specifically it seems to be ignoring the nullable nature of the A->B relationship, if for example you had a foreign key reference to B in A instead of the subclassing, that foreign key would of course be nullable and django would use a left join for it. You should probably raise this in the django issue tracker. You could also try using prefetch_related instead of select_related that might get around your issue.
I found a work around for this, but I will wait a while to accept it in hopes that I can get some better answers.
The INNER JOIN created by the select_related('b__fk_field') needs to be removed from the underlying SQL so that the results aren't filtered by the B records in the database. So the new query needs to leave the b__fk_field parameter in select_related out:
the_as = A.objects.select_related('b')
However, this forces us to call the database everytime a C object is accessed from the A object.
for a in the_as:
#Note that this throws an DoesNotExist error if a doesn't have an
#associated b
print a.b.fk_field.field2 #Hits the database everytime.
The hack to work around this is to get all of the C objects we need from the database from one query and then have each B object reference them manually. We can do this because the database call that accesses the B objects retrieved will have the fk_field_id that references their associated C object:
c_ids = [a.b.fk_field_id for a in the_as] #Get all the C ids
the_cs = C.objects.filter(pk__in=c_ids) #Run a query to get all of the needed C records
for c in the_cs:
for a in the_as:
if a.b.fk_field_id == c.pk: #Throws DoesNotExist if no b associated with a
a.b.fk_field = c
break
I'm sure there's a functional way to write that without the nested loop, but this illustrates what's happening. It's not ideal, but it provides all of the data with the absolute minimum number of database hits - which is what I wanted.