I'm working on a side project using python and Django. It's a website that tracks the price of some product from some website, then show all the historical price of products.
So, I have this class in Django:
class Product(models.Model):
price = models.FloatField()
date = models.DateTimeField(auto_now = True)
name = models.CharField()
Then, in my views.py, because I want to display products in a table, like so:
+----------+--------+--------+--------+--------+....
| Name | Date 1 | Date 2 | Date 3 |... |....
+----------+--------+--------+--------+--------+....
| Product1 | 100.0 | 120.0 | 70.0 | ... |....
+----------+--------+--------+--------+--------+....
...
I'm using the following class for rendering:
class ProductView(objects):
name = ""
price_history = {}
So that in my template, I can easily convert each product_view object into one table row. I'm also passing through context a sorted list of all available dates, for the purpose of constructing the head of the table, and getting the price of each product on that date.
Then I have logic in views that converts one or more products into this ProductView object. The logic looks something like this:
def conversion():
result_dict = {}
all_products = Product.objects.all()
for product in all_products:
if product.name in result_dict:
result_dict[product.name].append(product)
else:
result_dict[product.name] = [product]
# So result_dict will be like
# {"Product1":[product, product], "Product2":[product],...}
product_views = []
for products in result_dict.values():
# Logic that converts list of Product into ProductView, which is simple.
# Then I'm returning the product_views, sorted based on the price on the
# latest date, None if not available.
return sorted(product_views,
key = lambda x: get_latest_price(latest_date, x),
reverse = True)
As per Daniel Roseman and zymud, adding get_latest_price:
def get_latest_price(date, product_view):
if date in product_view.price_history:
return product_view.price_history[date]
else:
return None
I omitted the logic to get the latest date in conversion. I have a separate table that only records each date I run my price-collecting script that adds new data to the table. So the logic of getting latest date is essentially get the date in OpenDate table with highest ID.
So, the question is, when product grows to a huge amount, how do I paginate that product_views list? e.g. if I want to see 10 products in my web application, how to tell Django to only get those rows out of DB?
I can't (or don't know how to) use django.core.paginator.Paginator, because to create that 10 rows I want, Django needs to select all rows related to that 10 product names. But to figure out which 10 names to select, it first need to get all objects, then figure out which ones have the highest price on the latest date.
It seems to me the only solution would be to add something between Django and DB, like a cache, to store that ProductView objects. but other than that, is there a way to directly paginate produvt_views list?
I'm wondering if this makes sense:
The basic idea is, since I'll need to sort all product_views by the price on the "latest" date, I'll do that bit in DB first, and only get the list of product names to make it "paginatable". Then, I'll do a second DB query, to get all the products that have those product names, then construct that many product_views. Does it make sense?
To clear it a little bit, here comes the code:
So instead of
#def conversion():
all_products = Product.objects.all()
I'm doing this:
#def conversion():
# This would get me the latest available date
latest_date = OpenDate.objects.order_by('-id')[:1]
top_ten_priced_product_names = Product.objects
.filter(date__in = latest_date)
.order_by('-price')
.values_list('name', flat = True)[:10]
all_products_that_i_need = Product.objects
.filter(name__in = top_ten_priced_product_names)
# then I can construct that list of product_views using
# all_products_that_i_need
Then for pages after the first, I can modify that [:10] to say [10:10] or [20:10].
This makes the code pagination easier, and by pulling appropriate code into a separate function, it's also possible to do Ajax and all those fancy stuff.
But, here comes a problem: this solution needs three DB calls for every single query. Right now I'm running everything on the same box, but still I want to reduce this overhead to two(One or Opendate, the other for Product).
Is there a better solution that solves both the pagination problem and with two DB calls?
Related
I have this model:
class User_Data(AbstractUser):
date_of_birth = models.DateField(null=True,blank=True)
city = models.CharField(max_length=255,default='',null=True,blank=True)
address = models.TextField(default='',null=True,blank=True)
gender = models.TextField(default='',null=True,blank=True)
And I need to run a django query to get the count of each age. Something like this:
Age || Count
10 || 100
11 || 50
and so on.....
Here is what I did with lambda:
usersAge = map(lambda x: calculate_age(x[0]), User_Data.objects.values_list('date_of_birth'))
users_age_data_source = [[x, usersAge.count(x)] for x in set(usersAge)]
users_age_data_source = sorted(users_age_data_source, key=itemgetter(0))
There's a few ways of doing this. I've had to do something very similar recently. This example works in Postgres.
Note: I've written the following code the way I have so that syntactically it works, and so that I can write between each step. But you can chain these together if you desire.
First we need to annotate the queryset to obtain the 'age' parameter. Since it's not stored as an integer, and can change daily, we can calculate it from the date of birth field by using the database's 'current_date' function:
ud = User_Data.objects.annotate(
age=RawSQL("""(DATE_PART('year', current_date) - DATE_PART('year', "app_userdata"."date_of_birth"))::integer""", []),
)
Note: you'll need to change the "app_userdata" part to match up with the table of your model. You can pick this out of the model's _meta, but this just depends if you want to make this portable or not. If you do, use a string .format() to replace it with what the model's _meta provides. If you don't care about that, just put the table name in there.
Now we pick the 'age' value out so that we get a ValuesQuerySet with just this field
ud = ud.values('age')
And then annotate THAT queryset with a count of age
ud = ud.annotate(
count=Count('age'),
)
At this point we have a ValuesQuerySet that has both 'age' and 'count' as fields. Order it so it comes out in a sensible way..
ud = ud.order_by('age')
And there you have it.
You must build up the queryset in this order otherwise you'll get some interesting results. i.e; you can't group all the annotates together, because the second one for count depends on the first, and as a kwargs dict has no notion of what order the kwargs were defined in, when the queryset does field/dependency checking, it will fail.
Hope this helps.
If you aren't using Postgres, the only thing you'll need to change is the RawSQL annotation to match whatever database engine it is that you're using. However that engine can get the year of a date, either from a field or from its built in "current date" function..providing you can get that out as an integer, it will work exactly the same way.
The question is remotely related to Django ORM: filter primary model based on chronological fields from related model, by further limiting the resulting queryset.
The models
Assuming we have the following models:
class Patient(models.Model)
name = models.CharField()
# other fields following
class MedicalFile(model.Model)
patient = models.ForeignKey(Patient, related_name='files')
issuing_date = models.DateField()
expiring_date = models.DateField()
diagnostic = models.CharField()
The query
I need to select all the files which are valid at a specified date, most likely from the past. The problem that I have here is that for every patient, there will be a small overlapping period where a patient will have 2 valid files. If we're querying for a date from that small timeframe, I need to select only the most recent file.
More to the point: consider patient John Doe. he will have string of "uninterrupted" files starting with 2012 like this:
+---+------------+-------------+
|ID |issuing_date|expiring_date|
+---+------------+-------------+
|1 |2012-03-06 |2013-03-06 |
+---+------------+-------------+
|2 |2013-03-04 |2014-03-04 |
+---+------------+-------------+
|3 |2014-03-04 |2015-03-04 |
+---+------------+-------------+
As one can easily observe, there is an overlap of couple of days of the validity of these files. For instance, in 2013-03-05 the files 1 and 2 are valid, but we're considering only file 2 (as the most recent one). I'm guessing that the use case isn't special: this is the case of managing subscriptions, where in order to have a continuous subscription, you will renew your subscription earlier.
Now, in my application I need to query historical data, e.g. give me all the files which where valid at 2013-03-05, considering only the "most recent" ones. I was able to solve this by using RawSQL, but I would like to have a solution without raw SQL. In the previous question, we were able to filter the "latest" file by aggregation over the reverse relation, something like:
qs = MedicalFile.objects.annotate(latest_file_date=Max('patient__files__issuing_date'))
qs = qs.filter(issuing_date=F('latest_file_date')).select_related('patient')
The problem is that we need to limit the range over which latest_file_date is computed, by filtering against 2013-03-05. But aggregate function don't run over filtered querysets ...
The "poor" solution
I'm currently doing this via an extra queryset clause (substitute "app" with your concrete application):
reference_date = datetime.date(year=2013, month=3, day=5)
annotation_latest_issuing_date = {
'latest_issuing_date': RawSQL('SELECT max(file.issuing_date) '
'FROM <app>_medicalfile file '
'WHERE file.person_id = <app>_medicalfile.person_id '
' AND file.issuing_date <= %s', (reference_date, ))
}
qs = MedicalFile.objects.filter(expiring_date__gt=reference_date, issuing_date__lte=reference_date)
qs = qs.extra(**annotation_latest_issuing_date).filter(issuing_date=F('latest_issuing_date'))
Writen as such, the queryset returns correct number of records.
Question: how can it be achieved without RaWSQL and (already implied) with the same performance level ?
You can use id__in and provide your nested filtered queryset (like all files that are valid at the given date).
qs = MedicalFile.objects
.filter(id__in=self.filter(expiring_date__gt=reference_date, issuing_date__lte=reference_date))
.order_by('patient__pk', '-issuing_date')
.distinct('patient__pk') # field_name parameter only supported by Postgres
The order_by groups the files by patient, with the latest issuing date first. distinct then retrieves that first file for each patient. However, general care is required when combining order_by and distinct: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.distinct
Edit: Removed single patient dependence from first filter and changed latest to combination of order_by and distinct
Consider p is a Patient class instance.
I think you can do someting like:
p.files.filter(issue_date__lt='some_date', expiring_date__gt='some_date')
See https://docs.djangoproject.com/en/1.9/topics/db/queries/#backwards-related-objects
Or maybe with the Q magic query object...
I have two models:
Base_Activity:
some fields
User_Activity:
user = models.ForeignKey(settings.AUTH_USER_MODEL)
activity = models.ForeignKey(Base_Activity)
rating = models.IntegerField(default=0) #Will be -1, 0, or 1
Now I want to query Base_Activity, and sort the items that have the most corresponding user activities with rating=1 on top. I want to do something like the query below, but the =1 part is obviously not working.
activities = Base_Activity.objects.all().annotate(
up_votes = Count('user_activity__rating'=1),
).order_by(
'up_votes'
)
How can I solve this?
You cannot use Count like that, as the error message says:
SyntaxError: keyword can't be an expression
The argument of Count must be a simple string, like user_activity__rating.
I think a good alternative can be to use Avg and Count together:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).order_by(
'-a', '-c'
)
The items with the most rating=1 activities should have the highest average, and among the users with the same average the ones with the most activities will be listed higher.
If you want to exclude items that have downvotes, make sure to add the appropriate filter or exclude operations after annotate, for example:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).filter(user_activity__rating__gt=0).order_by(
'-a', '-c'
)
UPDATE
To get all the items, ordered by their upvotes, disregarding downvotes, I think the only way is to use raw queries, like this:
from django.db import connection
sql = '''
SELECT o.id, SUM(v.rating > 0) s
FROM user_activity o
JOIN rating v ON o.id = v.user_activity_id
GROUP BY o.id ORDER BY s DESC
'''
cursor = connection.cursor()
result = cursor.execute(sql_select)
rows = result.fetchall()
Note: instead of hard-coding the table names of your models, get the table names from the models, for example if your model is called Rating, then you can get its table name with Rating._meta.db_table.
I tested this query on an sqlite3 database, I'm not sure the SUM expression there works in all DBMS. Btw I had a perfect Django site to test, where I also use upvotes and downvotes. I use a very similar model for counting upvotes and downvotes, but I order them by the sum value, stackoverflow style. The site is open-source, if you're interested.
Lets assume I want to show a list of runners ordered by their latest sprint time.
class Runner(models.Model):
name = models.CharField(max_length=255)
class Sprint(models.Model):
runner = models.ForeignKey(Runner)
time = models.PositiveIntegerField()
created = models.DateTimeField(auto_now_add=True)
This is a quick sketch of what I would do in SQL:
SELECT runner.id, runner.name, sprint.time
FROM runner
LEFT JOIN sprint ON (sprint.runner_id = runner.id)
WHERE
sprint.id = (
SELECT sprint_inner.id
FROM sprint as sprint_inner
WHERE sprint_inner.runner_id = runner.id
ORDER BY sprint_inner.created DESC
LIMIT 1
)
OR sprint.id = NULL
ORDER BY sprint.time ASC
The Django QuerySet documentation states:
It is permissible to specify a multi-valued field to order the results
by (for example, a ManyToManyField field). Normally this won’t be a
sensible thing to do and it’s really an advanced usage feature.
However, if you know that your queryset’s filtering or available data
implies that there will only be one ordering piece of data for each of
the main items you are selecting, the ordering may well be exactly
what you want to do. Use ordering on multi-valued fields with care and
make sure the results are what you expect.
I guess I need to apply some filter here, but I'm not sure what exactly Django expects...
One note because it is not obvious in this example: the Runner table will have several hundred entries, the sprints will also have several hundreds and in some later days probably several thousand entries. The data will be displayed in a paginated view, so sorting in Python is not an option.
The only other possibility I see is writing the SQL myself, but I'd like to avoid this at all cost.
I don't think there's a way to do this via the ORM with only one query, you could grab a list of runners and use annotate to add their latest sprint id's -- then filter and order those sprints.
>>> from django.db.models import Max
# all runners now have a `last_race` attribute,
# which is the `id` of the last sprint they ran
>>> runners = Runner.objects.annotate(last_race=Max("sprint__id"))
# a list of each runner's last sprint ordered by the the sprint's time,
# we use `select_related` to limit lookup queries later on
>>> results = Sprint.objects.filter(id__in=[runner.last_race for runner in runners])
... .order_by("time")
... .select_related("runner")
# grab the first result
>>> first_result = results[0]
# you can access the runner's details via `.runner`, e.g. `first_result.runner.name`
>>> isinstance(first_result.runner, Runner)
True
# this should only ever execute 2 queries, no matter what you do with the results
>>> from django.db import connection
>>> len(connection.queries)
2
This is pretty fast and will still utilize the databases's indices and caching.
A few thousand records isn't all that much, this should work pretty well for those kinds of numbers. If you start running into problems, I suggest you bite the bullet and use raw SQL.
def view_name(request):
spr = Sprint.objects.values('runner', flat=True).order_by(-created).distinct()
runners = []
for s in spr:
latest_sprint = Sprint.objects.filter(runner=s.runner).order_by(-created)[:1]
for latest in latest_sprint:
runners.append({'runner': s.runner, 'time': latest.time})
return render(request, 'page.html', {
'runners': runners,
})
{% for runner in runners %}
{{runner.runner}} - {{runner.time}}
{% endfor %}
I have a report model looking a bit like this:
class Report(models.Model):
date = models.DateField()
quantity = models.IntegerField()
product_name = models.TextField()
I know I can get the last entry for the last year for one product this way:
Report.objects.filter(date__year=2009, product_name="corn").order_by("-date")[0]
I know I can group entries by name this way:
Report.objects.values("product_name")
But how can I get the quantity for the last entry for each product ? I feel like I would do it this way in SQL (not sure, my SQL is rusty):
SELECT product_name, quantity FROM report WHERE YEAR(date) == 2009 GROUP_BY product_name HAVING date == Max(date)
My guess is to use the Max() object with annotate, but I have no idea how to.
For now, I do it by manually adding the last item of each query for each product_name I cant list with a distinct.
Not exactly a trivial query using either the Django ORM or SQL. My first take on it would be to pretty much what you are probably already doing; get the distinct product and date pairs and then perform individual queries for each of those.
year_products = Product.objects.filter(year=2009)
product_date_pairs = year_products.values('product').distinct('product'
).annotate(Max('date'))
[Report.objects.get(product=p['product'], date=p['date__max'])
for p in product_date_pairs]
But you can take it a step further with the Q operator and some fancy OR'ing to trim your query count down to 2 instead of N + 1.
import operator
qs = [Q(product=p['product'], date=p['date__max']) for p in product_date_pairs]
ored_qs = reduce(operator.or_, qs)
Report.objects.filter(ored_qs)