Django filter exact m2m objects - django

Let's say I have a team model, and a team has members.
So
class Team(models.Model):
team_member = models.ManyToManyField('Employee')
class Employee(models.Model):
....
Lets say I have a list of employee ids like team_members = [1001, 1003, 1004] and I want to find the Team, that is made up of exactly those three members.
I don't want the team that has [1001, 1003, 1004, 1005] or the team that has [1001, 1003].
Only team [1001, 1003, 1004].
This is what I'm doing now:
teams = Team.objects.all()
for t in teams:
if set([x.id for x in t.team_member.all()]) == set(team_members):
team = t
if not team:
team = Team.objects.create()
team.team_member = team_members
But it seems a bit ham-handed. Is there a cleaner way, with fewer nested loops?

The short answer
No, I don't know of a much simpler way in terms of code appearance.
However there are some things you could do to make your code a little more graceful and potentially a lot faster. Plus it is possible to do the work in the database, albeit quite inefficiently for large team sizes.
The DB option listed below is pretty much as ham-handed as the for loop you provided, but could be more efficient depending on your data set, DB, etc.
Longer answer: ways to be less 'ham-handed'
There are a couple of places I'd clean up the style here.
Plus, in my experience with Django, loops like the one you built do tend to become pretty expensive on large data sets. If you end up loading, say, 10,000 teams into memory, having the ORM convert them to Team objects, and then iterating over them, you'll probably see some significant slowdown.
Two things to try for speed & grace:
Use Team.values_list('team_members') for your in-python filter loop, which skips the step where Django organizes all of the SQL data into Model objects. I've found this to save lots of time instantiating objects (sometimes around an order of magnitude).
straighten out your set() calls. Currently you're re-converting team_members to a set() on every iteration, plus you're turning t.team_member implicitly into TeamMember objects (as they're fetched from the DB), then into a list of ids and then into a set. For the first item, just make a team_members_set = set(team_members) up front and reuse it. For the second item, you can do set(t.team_member.values_list('id', flat=True)) which will skip the heaviest ORM step of instantiating TeamMembers (which could be as bad as O(n^2) in your example depending on the data set and Django's caching).
use Team.objects.all().iterator() to not load the Teams all into memory at once. This will help if you're running into memory issues.
But with any performance optimization, of course test your perf with real or real-ish data to be sure you're not making things worse!
Longer answer: the DB option
After trying all manner of Q() manipulation and other approaches listed in the answers here, to no avail, I found this answer by #Todor.
Basically you need to do repeated filter()s, one for each team_member. On top of that you use a Count filter to make sure that you don't end up choosing a Team with a superset of the desired members.
desired_members = [1001, 1003, 1004]
initial_queryset = Team.objects.annotate(cnt=models.Count('team_members')).filter(cnt=len(desired_members))
matching_teams = reduce( # Can of course use a for loop if you prefer that to reduce()
lambda queryset, member: queryset.filter(team_members=member),
desired_members,
initial_queryset
)
Note that the resulting query will likely have perf issues for large teams, since it will do one JOIN for every one of your desired_members. It'd be nice to avoid that but I don't know of another way to do this all in the database without changing your data structure. I'd love to learn a better way, and if you end up doing some perf testing I'd be curious to find what you learn!

Maybe you can use annotate for the count of team_member. Can you try this?
Team.objects.filter(team_member__pk__in=team_members).annotate(num_team=Count('team_member')).filter(num_team=len(team_members))

To get the Team with those exact three members you can use:
Team.objects.get(team_member__pk=team_members) # This code was untested
You could also try with a list of Employee objects:
# team_members = Employee.objects.filter(pk__in=tem_members)
team_members = [<Employee: Employee object>, <Employee: Employee object>, <Employee: Employee object>]
Team.objects.get(team_member=team_members)

Related

AWS Machine Learning Data

I'm using the AWS Machine Learning regression to predict the waiting time in a line of a restaurant, in a specific weekday/time.
Today I have around 800k data.
Example Data:
restaurantID (rowID)weekDay (categorical)time (categorical)tablePeople (numeric)waitingTime (numeric - target)1 sun 21:29 2 23
2 fri 20:13 4 43
...
I have two questions:
1)
Should I use time as Categorical or Numeric?
It's better to split into two fields: minutes and seconds?
2)
I would like in the same model to get the predictions for all my restaurants.
Example:
I expected to send the rowID identifier and it returns different predictions, based on each restaurant data (ignoring others data).
I tried, but it's returning the same prediction for any rowID. Why?
Should I have a model for each restaurant?
There are several problems with the way you set-up your model
1) Time in the form you have it should never be categorical. Your model treats times 12:29 and 12:30 as two completely independent attributes. So it will never use facts it learn about 12:29 to predict what's going to happen at 12:30. In your case you either should set time to be numeric. Not sure if amazon ML can convert it for you automatically. If not just multiply hour by 60 and add minutes to it. Another interesting thing to do is to bucketize your time, by selecting which half hour or wider interval. You do it by dividing (h*60+m) by some number depending how many buckets you want. So to try 120 to get 2 hr intervals. Generally the more data you have the smaller intervals you can have. The key is to have a lot of samples in each bucket.
2) You should really think about removing restaurantID from your input data. Having it there will cause the model to over-fit on it. So it will not be able to make predictions about restaurant with id:5 based on the facts it learn from restaurants with id:3 or id:9. Having restaurant id there might be okay if you have a lot of data about each restaurant and you don't care about extrapolating your predictions to the restaurants that are not in the training set.
3) You never send restaurantID to predict data about it. The way it usually works you need to pick what are you trying to predict. In your case probably 'waitingTime' is most useful attribute. So you need to send weekDay, time and number of people and the model will output waiting time.
You should think what is relevant for the prediction to be accurate, and you should use your domain expertise to define the features/attributes you need to have in your data.
For example, time of the day, is not just a number. From my limited understanding in restaurant, I would drop the minutes, and only focus on the hours.
I would certainly create a model for each restaurant, as the popularity of the restaurant or the type of food it is serving is having an impact on the wait time. With Amazon ML it is easy to create many models as you can build the model using the SDK, and even schedule retraining of the models using AWS Lambda (that mean automatically).
I'm not sure what the feature called tablePeople means, but a general recommendation is to have as many as possible relevant features, to get better prediction. For example, month or season is probably important as well.
In contrast with some answers to this post, I think resturantID helps and it actually gives valuable information. If you have a significant amount of data per each restaurant then you can train a model per each restaurant and get a good accuracy, but if you don't have enough data then resturantID is very informative.
1) Just imagine what if you had only two columns in your dataset: restaurantID and waitingTime. Then wouldn't you think the restaurantID from the testing data helps you to find a rough waiting time? In the simplest implementation, your waiting time per each restaurantID would be the average of waitingTime. So definitely restaurantID is a valuable information. Now that you have more features in your dataset, you need to check if restaurantID is as effective as the other features or not.
2) If you decide to keep restaurantID then you must use it as a categorical string. It should be a non-parametric feature in your dataset and maybe that's why you did not get a proper result.
On the issue with day and time I agree with other answers and considering that you are building your model for the restaurant, hourly time may give a more accurate result.

Modelling EVERY day in Django

I have a booking system for something where the price can change based on the day. The admins for the site can make these changes. If a booking crosses the boundary of a daily rate, they pay pro-rata for the rates they used.
I'm losing confidence in how this is implemented. There are at least two ways:
Having Rates that specify their validity (start, end fields) and then working out which of those apply. But which overlapping ones take priority? Etc. Nasty. This is what we're trying to do and cannot currently answer sufficiently well.
The same except that there is some form of unique quality to date so that no two rates can overlap. The problem here is we'd need to split existing Rates on insert and rejoin two on delete/edit, etc if they had the same value. We'd need to make sure there were no gaps. It requires some heavy ORM overriding.
Keeping a DayRate table with every day defined. This means keeping a load of extra data around but most bookings are for tens of days, not thousands so I'm not worried about the database bandwidth requirements here. Date would be primary-unique and I'd just do a range filter for grabbing which ones I need to factor in.
The problem is generating these dates ahead of time. I know that as soon as I implement this, somebody will make a booking for 2032. Is there a good way around this or should we limit them?
None of these answers seems great and I have to imagine that I'm not the first guy with a booking system. Is there a better way of keeping track of a rate over a contiguous (possibly infinite) amount of time?

Best practices for managing workarounds (for broken data)

I have to work with government-provided data that is sometimes broken in strange ways. My code already contains snippets like:
for row in governmental_data:
# XXX Workaround for that one row among thousands
# that was mislabeled by a clerk and will not be fixed
# before form A-320-Tango-5 is completed and submitted
# on the first Sunday after a solstice.
if row is the_spawn_of_satan:
row = fix_row_A320(row)
# XXX end of workaround
process_row(row)
which before the error was just
for row in governmental_data:
process_row(row)
I can not make a mirror of the data with applied fixes, because the data is dynamic.
What can I do to manage these workarounds as they grow in number? Are there any best practices (besides "do not provide broken data to begin with")?
I suggest use Decorator Design Pattern for handling this data conversion issue. Wikipedia page
has a coffee making example. In the same way I suggest every data conversion should be decorator which takes a row and makes some operations on it and gives back a row. This design pattern is well established one. Intercepting filters design pattern is similar to this idea which is implemented both in java (servlet filters) and .net (Asp.Net Mvc Filters).
Your code should be as following
listOfDataConversionFilters = [XXXWorkaround,formA_320Tango5,...]
for row in governmental_data:
for filter in listOfDataConversionFilters
filteredRow = filter(row)
process_row(filteredRow)

GAE ndb design, performance and use of repeated properties

Say I have a picture gallery and a picture could potentially have 100k+ fans. Which ndb design is more efficient?
class picture(ndb.model):
fanIds = ndb.StringProperty(repeated=True)
... [other picture properties]
or
class picture(ndb.model):
... [other picture properties]
class fan(ndb.model):
pictureId = StringProperty()
fanId = StringProperty()
Is there any limit on the number of items you can add to an ndb repeated property and is there any performance hit with storing a large amount of items in a repeated property? If it is less efficient to use repeated properties, what is their intended use?
Do not use repeated properties if you have more than 100-1000 values. (1000 is probably already pushing it.) They weren't designed for such use.
Generally v1 would be much cheaper.
In terms of read/write costs, you pay per entity fetch/written, so you want to reduce the number of entities. version 1 will be cheaper. Significantly cheaper if you fetch every fan every time you fetch a picture.
However each entity is limited to 1MB. If you have 100k+ fans, you could hit that limit depending on the size of your fanId. That's not counting your other picture data, so you could blow that 1MB limit. You'll have to add some more complex code to handle overflow cases.
Large entities take longer to fetch than small entities. If you're going to fetch all the fans at once all the time, v1 will be better. If you're only going to fetch say 5 fans at any one point, v2 might be faster (only might). If on the other hand you try to pull 100k fan entities... that's gonna take forever.

How to optimize use of querysets with lists

I have a model that has a couple million objects. Each object represents a call made/received by a company.
To simplify things, let's say this model, Call, has these fields:
calldate, context, channel.
My goal is to know the average # of calls made and received during each hour of the day of the month (load by hour). The catch is: I need to find this for port1 and port2 separately.
As of now, my code works fine, except that it takes around 1 whole minute to give me the result for a range of 4 months and I it seems extremely inefficient.
I've done some simple profiling and discovered that the extend is taking around 99% of the processing time:
queryset = Call.objects.filter(calldate__gte='SOME_DATE')
port1, port2 = [],[]
port1.extend(queryset.filter(context__icontains="e1-1"))
port2.extend(queryset.filter(context__icontains="e1-2"))
channels_in_port1 = ["Port/%d-2" % x for x in range(1,32)]
channels_in_port2 = ["Port/%d-2" % x for x in range(32,63)]
for i in channels_in_port1:
port1.extend(queryset.filter(channel__icontains=i))
for i in channels_in_port2:
port2.extend(queryset.filter(channel__icontains=i))
port1 and port2 have around 150k objects combined now.
As soon as I have all calls for port1 and port2, I'm good to go. The rest of the code is basically some for loops for port1 and port2 that sums up and takes the average of calls according to the hour/day/month. Trivial stuff.
I tried to avoid using any "extend" by using itertools.chain and chaining the querysets instead. However, that made the processing time shift to the part where I do the trivial for loops to calculate the load by hour.
Any alternatives? Better ways to filter the queryset?
Thanks very much!!
Have you considered using django's aggregate functions? http://docs.djangoproject.com/en/dev/topics/db/aggregation/
I presume your problem is with the second set of extends, ie those within the for loops, rather than the first. (The first is completely unnecessary, in any case: rather than defining an empty list up front and extending it, you can just do port1 = list(queryset.filter(context__icontains="e1-1")).)
Anyway, to summarize what I think you are trying to do: you want to get all Call objects for a certain date, in two blocks depending on the value for channel: one where it contains values from 0 to 31, and one with values between 32 and 62.
It seems like you could do this with just two queries, without any extending at all:
port1 = queryset.filter(channel__range=["Port/1-2", "Port/31-2"])
port2 = queryset.filter(channel__range=["Port/1-32", "Port/31-62"])
Does that not do what you want?
Edit in response to comment but that's then just two queries which you can extend, or concatenate. The problem with your code as posted is that you are doing 31 queries and extend operations for each port, which is bound to be expensive. If you just do one each, plus one extend/concat, that will be much cheaper.