ndb verify entity uniqueness in transaction - python-2.7

I've been trying to create entities with a property which should be unique or None something similar to:
class Thing(ndb.Model):
something = ndb.StringProperty()
unique_value = ndb.StringProperty()
Since ndb has no way to specify that a property should be unique it is only natural that I do this manually like this:
def validate_unique(the_thing):
if the_thing.unique_value and Thing.query(Thing.unique_value == the_thing.unique_value).get():
raise NotUniqueException
This works like a charm until I want to do this in an ndb transaction which I use for creating/updating entities. Like:
#ndb.transactional
def create(the_thing):
validate_unique(the_thing)
the_thing.put()
However ndb seems to only allow ancestor queries, the problem is my model does not have an ancestor/parent. I could do the following to prevent this error from popping up:
#ndb.non_transactional
def validate_unique(the_thing):
...
This feels a bit out of place, declaring something to be a transaction and then having one (important) part being done outside of the transaction. I'd like to know if this is the way to go or if there is a (better) alternative.
Also some explanation as to why ndb only allows ancestor queries would be nice.

Since your uniqueness check involves a (global) query it means it's subject to the datastore's eventual consistency, meaning it won't work as the query might not detect freshly created entities.
One option would be to switch to an ancestor query, if your expected usage allows you to use such data architecture, (or some other strongly consistent method) - more details in the same article.
Another option is to use an additional piece of data as a temporary cache, in which you'd store a list of all newly created entities for "a while" (giving them ample time to become visible in the global query) which you'd check in validate_unique() in addition to those from the query result. This would allow you to make the query outside the transaction and only enter the transaction if uniqueness is still possible, but the ultimate result is the manual check of the cache, inside the transaction (i.e. no query inside the transaction).
A 3rd option exists (with some extra storage consumption as the price), based on the datastore's enforcement of unique entity IDs for a certain entity model with the same parent (or no parent at all). You could have a model like this:
class Unique(ndb.Model): # will use the unique values as specified entity IDs!
something = ndb.BooleanProperty(default=False)
which you'd use like this (the example uses a Unique parent key, which allows re-using the model for multiple properties with unique values, you can drop the parent altogether if you don't need it):
#ndb.transactional
def create(the_thing):
if the_thing.unique_value:
parent_key = get_unique_parent_key()
exists = Unique.get_by_id(the_thing.unique_value, parent=parent_key)
if exists:
raise NotUniqueException
Unique(id=the_thing.unique_value, parent=parent_key).put()
the_thing.put()
def get_unique_parent_key():
parent_id = 'the_thing_unique_value'
parent_key = memcache.get(parent_id)
if not parent_key:
parent = Unique.get_by_id(parent_id)
if not parent:
parent = Unique(id=parent_id)
parent.put()
parent_key = parent.key
memcache.set(parent_id, parent_key)
return parent_key

Related

Return object when aggregating grouped fields in Django

Assuming the following example model:
# models.py
class event(models.Model):
location = models.CharField(max_length=10)
type = models.CharField(max_length=10)
date = models.DateTimeField()
attendance = models.IntegerField()
I want to get the attendance number for the latest date of each event location and type combination, using Django ORM. According to the Django Aggregation documentation, we can achieve something close to this, using values preceding the annotation.
... the original results are grouped according to the unique combinations of the fields specified in the values() clause. An annotation is then provided for each unique group; the annotation is computed over all members of the group.
So using the example model, we can write:
event.objects.values('location', 'type').annotate(latest_date=Max('date'))
which does indeed group events by location and type, but does not return the attendance field, which is the desired behavior.
Another approach I tried was to use distinct i.e.:
event.objects.distinct('location', 'type').annotate(latest_date=Max('date'))
but I get an error
NotImplementedError: annotate() + distinct(fields) is not implemented.
I found some answers which rely on database specific features of Django, but I would like to find a solution which is agnostic to the underlying relational database.
Alright, I think this one might actually work for you. It is based upon an assumption, which I think is correct.
When you create your model object, they should all be unique. It seems highly unlikely that that you would have two events on the same date, in the same location of the same type. So with that assumption, let's begin: (as a formatting note, class Names tend to start with capital letters to differentiate between classes and variables or instances.)
# First you get your desired events with your criteria.
results = Event.objects.values('location', 'type').annotate(latest_date=Max('date'))
# Make an empty 'list' to store the values you want.
results_list = []
# Then iterate through your 'results' looking up objects
# you want and populating the list.
for r in results:
result = Event.objects.get(location=r['location'], type=r['type'], date=r['latest_date'])
results_list.append(result)
# Now you have a list of objects that you can do whatever you want with.
You might have to look up the exact output of the Max(Date), but this should get you on the right path.

Django queryset behind the scenes

**
Difference between creating a foreign key for consistency and for joins
**
I am fine to use Foreignkey and Queryset API with Django.
I just want to understand little bit more deeply how it works behind the scenes.
In Django manual, it says
a database index is automatically created on the ForeignKey. You can
disable this by setting db_index to False. You may want to avoid the
overhead of an index if you are creating a foreign key for consistency
rather than joins, or if you will be creating an alternative index
like a partial of multiple column index.
creating for a foreign key for consistency rather than joins
this part is confusing me.
I expected that you use Join keyword if you do query with Foreign key like below.
SELECT
*
FROM
vehicles
INNER JOIN users ON vehicles.car_owner = users.user_id
For example,
class Place(models.Model):
name = models.Charfield(max_length=50)
address = models.Charfield(max_length=50)
class Comment(models.Model):
place = models.ForeignKeyField(Place)
content = models.Charfield(max_length=50)
if you use queryset like Comment.objects.filter(place=1), i expected using Join Keyword in low level SQL command.
but, when I checked it by printing out queryset.query in console, it showed like below.
(I simplified with Model just to explains. below, it shows all attributes in my model. you can ignore attributes)
SELECT
"bfm_comment"."id", "bfm_comment"."content", "bfm_comment"."user_id", "bfm_comment"."place_id", "bfm_comment"."created_at"
FROM "bfm_comment" WHERE "bfm_comment"."place_id" = 1
creating a foreign key for consistency vs creating a foreign key for joins
simply, I thought if you use any queryset, it means using foreign key for joins. Because you can get parent's table data by c = Comment.objects.get(id=1) c.place.name easily. I thought it joins two tables behind scenes. But result of Print(queryset.query) didn't how Join Keyword but Find it by Where keyword.
The way I understood from an answer
Case 1:
Comment.objects.filter(place=1)
result
SELECT
"bfm_comment"."id", "bfm_comment"."content", "bfm_comment"."user_id", "bfm_comment"."place_id", "bfm_comment"."created_at"
FROM "bfm_comment"
WHERE "bfm_comment"."id" = 1
Case 2:
Comment.objects.filter(place__name="df")
result
SELECT "bfm_comment"."id", "bfm_comment"."content", "bfm_comment"."user_id", "bfm_comment"."place_id", "bfm_comment"."created_at"
FROM "bfm_comment" INNER JOIN "bfm_place" ON ("bfm_comment"."place_id" = "bfm_place"."id")
WHERE "bfm_place"."name" = df
Case1 is searching rows which has comment.id column is 1 in just Comment table.
But in Case 2, it needs to know Place table's attribute 'name', so It has to use JOIN keyword to check values in column of Place table. Right?
So Is it alright to think that I create a foreign key for joins if i use queryset like Case2 and that it is better to create index on the Foreign Key?
for above question, I think I can take the answer from Django Manual
Consider adding indexes to fields that you frequently query using
filter(), exclude(), order_by(), etc. as indexes may help to speed up
lookups. Note that determining the best indexes is a complex
database-dependent topic that will depend on your particular
application. The overhead of maintaining an index may outweigh any
gains in query speed
In conclusion, it really depends on how my application work with it.
If you execute the following command the mystery will be revealed
./manage.py sqlmigrate myapp 0001
Take care to replace myapp with your app name (bfm I think) and 0001 with the actual migration where the Comment model is created.
The generated sql will reveal that the actual table is created with place_id int rather than a place Place that is because the RDBMS doesn't know anything about models, the models are only in the application level. It's the job of the django orm to fetch the data from the RDBMS and convert them into model instances. That's why you always get a place member in each of your Comment instances and that place member gives you access to the members of the related Place instance in turn.
So what happens when you do?
Comment.objects.filter(place=1)
Django is smart enough to know that you are referring to a place_id because 1 is obviously not an instance of a Place. But if you used a Place instance the result would be the same. So there is no join here. The above query would definitely benefit from having an index on the place_id, but it wouldn't benefit from having a foreign key constraint!! Only the Comment table is queried.
If you want a join, try this:
Comment.objects.filter(place__name='my home')
Queries of this nature with the __ often result in joins, but sometimes it results in a sub query.
Querysets are lazy.
https://docs.djangoproject.com/en/1.10/topics/db/queries/#querysets-are-lazy
QuerySets are lazy – the act of creating a QuerySet doesn’t involve
any database activity. You can stack filters together all day long,
and Django won’t actually run the query until the QuerySet is
evaluated. Take a look at this example:

How to create an auto increment field for a django model, with two keys which are unique_together?

I have a model like
class Item(models.Model):
site = Site()
id_on_site = PositiveIntegerField()
Now i want to create an instance Item(current_site, next_id_on_site) with
next_id_on_site = Item.objects.filter(site=current_site).aggregate(current_id=Max("id_on_site"))['current_id']+1
The problem is, that the operation of generating the ID and creating the Item is not atomic, so there is a race condition which creates duplicate IDs, so .get(site=current_site, id_on_site=someid) will raise a MultipleObjectsReturned exception.
Using unique_together in the model does not help with the generation of the auto increment ID and doesn't seem to be implemented at the DB level at all.
unique_together is definitely implemented in the database, but since it generates a unique database index, you may need to run a migration to see its effects.
If all you want is for id_on_site to be some unique identifier for the item at the site, it might be easier to use something like a UUIDField(default=uuid.uuid4), which has a near-certain guarantee of uniqueness. If you need the ID to be an auto-incrementing integer, that's a bit harder.
One option to avoid the race condition would be to lock the row with the highest id_on_site value. (This will only work on certain database backends, e.g. postgres):
from django.db.transaction import atomic
with atomic():
next_id_on_site = (
Item.objects
.filter(site=current_site)
.select_for_update(nowait=False)
.latest('id_on_site').id_on_site)
Item.objects.create(current_site, next_id_on_site)
This should cause other transactions to block if they're trying to get the highest id_on_site, and once the transaction is committed, the item you just inserted will be returned to the other transaction. This could be problematic if the transaction is long-lived for some reason.

Django filter on two fields of the same foreign key object

I have a database schema similar to this:
class User(models.Model):
… (Some fields irrelevant for this query)
class UserNotifiy(models.Model):
user = models.ForeignKey(User)
target = models.ForeignKey(<Some other Model>)
notification_level = models.SmallPositivIntegerField(choices=(1,2,3))
Now I want to query for all Users that have a UserNotify object for a specific target and at least a specific notification level (e.g. 2).
If I do something like this:
User.objects.filter(usernotify__target=desired_target,
usernotify__notification_level__gte=2)
I get all Users that have a UserNotify object for the specified target and at least one UserNotify object with a notification_level greater or equal to 2. These two UserNotify objects, however, do not have to be identical.
I am aware that I can do something like this:
user_ids = UserNotify.objects.filter(target=desired_target,
notification_level__gte=2).values_list('user_id', flat=True)
users = User.objects.filter(id__in=user_ids).distinct()
But this seems a step too much for me and I believe it executes two queries.
Is there a way to solve my problem with a single query?
Actually I don't see how you can run the first query, given that usernotify is not a valid field name for User.
You should start from UserNotify as you did in your second example:
UserNotify.objects.filter(
target=desired_target,
notification_level__gte=2
).select_related('user').values('user').distinct()
I've been looking for this behaviour but I've never found a better way than the one you describe (creating a query for user ids and inject it in a User query). Note this is not bad since if your database support subqueries, your code should fire only one request composed by a query and a subquery.
However, if you just need a particular field from the User objects (for example first_name), you may try
qs = (UserNotify.objects
.filter(target=desired_target, notification_level__gte=2)
.values_list('user_id', 'user__first_name')
.order_by('user_id')
.distinct('user_id')
)
I am not sure if I understood your question, but:
class User(models.Model):
… (Some fields irrelevant for this query)
class UserNotifiy(models.Model):
user = models.ForeignKey(User, related_name="notifications")
target = models.ForeignKey(<Some other Model>)
notification_level = models.SmallPositivIntegerField(choices=(1,2,3))
Then
users = User.objects.select_related('notifications').filter(notifications__target=desired_target,
notifications__notification_level__gte=2).distinct('id')
for user in users:
notifications = [x for x in user.notifications.all()]
I don't have my vagrant box handy now, but I believe this should work.

Searching a many to many database using Google Cloud Datastore

I am quite new to google app engine. I know google datastore is not sql, but I am trying to get many to many relationship behaviour in it. As you can see below, I have Gif entities and Tag entities. I want my application to search Gif entities by related tag. Here is what I have done;
class Gif(ndb.Model):
author = ndb.UserProperty()
link = ndb.StringProperty(indexed=False)
class Tag(ndb.Model):
name = ndb.StringProperty()
class TagGifPair(ndb.Model):
tag_id = ndb.IntegerProperty()
gif_id = ndb.IntegerProperty()
#classmethod
def search_gif_by_tag(cls, tag_name)
query = cls.query(name=tag_name)
# I am stuck here ...
Is this a correct start to do this? If so, how can I finish it. If not, how to do it?
You can use repeated properties https://developers.google.com/appengine/docs/python/ndb/properties#repeated the sample in the link uses tags with entity as sample but for your exact use case will be like:
class Gif(ndb.Model):
author = ndb.UserProperty()
link = ndb.StringProperty(indexed=False)
# you store array of tag keys here you can also just make this
# StringProperty(repeated=True)
tag = ndb.KeyProperty(repeated=True)
#classmethod
def get_by_tag(cls, tag_name):
# a query to a repeated property works the same as if it was a single value
return cls.query(cls.tag == ndb.Key(Tag, tag_name)).fetch()
# we will put the tag_name as its key.id()
# you only really need this if you wanna keep records of your tags
# you can simply keep the tags as string too
class Tag(ndb.Model):
gif_count = ndb.IntegerProperty(indexed=False)
Maybe you want to use list? I would do something like this if you only need to search gif by tags. I'm using db since I'm not familiar with ndb.
class Gif(db.Model):
author = db.UserProperty()
link = db.StringProperty(indexed=False)
tags = db.StringListProperty(indexed=True)
Query like this
Gif.all().filter('tags =', tag).fetch(1000)
There's different ways of doing many-to-many relationships. Using ListProperties is one way. The limitation to keep in mind if using ListProperties is that there's a limit to the number of indexes per entity, and a limit to the total entity size. This means that there's a limit to the number of entities in the list (depending on whether you hit the index count or entity size first). See the bottom of this page: https://developers.google.com/appengine/docs/python/datastore/overview
If you believe the number of references will work within this limit, this is a good way to go. Considering that you're not going to have thousands of admins for a Page, this is probably the right way.
The other way is to have an intermediate entity that has reference properties to both sides of your many-to-many. This method will let you scale much higher, but because of all the extra entity writes and reads, this is much more expensive.