django - queryset.last() not returning right order / last record - django

Writing code that is generating JSON. The last section of JSON has to be terminated by a ",", so in the code I have:
-- Define a queryset to retrieve distinct values of the database field:
databases_in_workload = DatabaseObjectsWorkload.objects.filter(workload=migration.workload_id).values_list('database_object__database', flat=True).distinct()
-- Then I cycle over it:
for database_wk in databases_in_workload:
... do something
if not (database_wk == databases_in_workload.last()):
job_json_string = job_json_string + '} ],'
else:
job_json_string = job_json_string + '} ]'
I want the last record to be terminated by a square bracket, the preceding by a comma. But instead, the opposite is happening.
I also looked at the database table content. The values I have for "database_wk" are user02 (for the records with a lower value of primary key) and user01 (for the records with the higher value of pk in the DB). The order (if user01 is first or last) really doesn't matter, as long as the last record is correctly identified by last() - so if I have user02, user01 in the query set iterations, I expect last() to return user01. However - this is not working correctly.
What is strange is that if in the database (Postgres) order is changed (first have user01, then user02 ordered by primary key values) then the "if" code above works, but in my situation last() seems to be returning the first record, not the last. It's as if there is one order in the database, another in the query set, and last() is taking the database order... Anybody encountered/solved this issue before? Alternatively - any other method for identifying the last record in a query set (other than last()) which I could try would also help. Many thanks in advance!

The reason is behaving the way it does is because there is no ordering specified. Try using order_by. REF
From: queryset.first()
If the QuerySet has no ordering defined, then the queryset is automatically ordered by the primary key
From: queryset.last()
Works like first(), but returns the last object in the queryset.
If you don't want to use order_by then try using queryset.latest()

Related

Django inner join with search terms for tables across many-to-many relationships

I have a dynamic hierarchical advanced search interface I created. Basically, it allows you to search terms, from about 6 or 7 tables that are all linked together, that can be and'ed or or'ed together in any combination. The search entries in the form are all compiled into a complex Q expression.
I discovered a problem today. If I provide a search term for a field in a many-to-many related sub-table, the output table can include results from that table that don't match the term.
My problem can by reproduced in the shell with a simple query:
qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
sqs = qs[0].msrun.sample.animal.studies.all()
sqs.count()
#result: 2
Specifically:
In [3]: qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
In [12]: ss = s[0].msrun.sample.animal.studies.all()
In [13]: ss[0].__dict__
Out[13]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bfbf940>,
'id': 3,
'name': 'Small OBOB',
'description': ''}
In [14]: ss[1].__dict__
Out[14]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bea81f0>,
'id': 1,
'name': 'obob_fasted',
'description': ''}
The ids in sqs queryset include 1 & 3 even though I only searched on 3. I don't get back literally all studies, so it is filtering some un-matching study records. I understand why I see that, but I don't know how to execute a query that treats it like a join I could perform in SQL where I can restrict the results to only include records that match the query, instead of getting back only records in the root model and gathering everything left-joined to those root model records.
Is there a way to do such an inner join (as the result of a single complex Q expression in a filter) on the entire set of linked tables so that I only get back records that match the M:M field search term?
UPDATE:
By looking at the SQL:
In [3]: str(s.query)
Out[3]: 'SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC'
...I can see that the query is as specific as I would like it to be, but in the template, how do I specify that I want what I would have seen in the SQL result, had I supplied all of the specific related table fields I wanted to see in the output? E.g.:
SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id", "DataRepo_animal_studies"."study_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC
All the Django ORM gives you back from a query (whose filters use fields from many-to-many [henceforth "M:M"] related tables), is a set of records from the "root" table from which the query started. It uses the join logic that you would use in an SQL query, obtaining records using the foreign keys, so you are guaranteed to get back root table records that DO link to a M:M related table record that matches your search term, but when you send the root table records(/queryset) to be rendered in a template and insert a nested loop to access records from the M:M related table, it always gets everything linked to that root table record - whether they match your search term or not, so if you get back a root table record that links to multiple records in an M:M related table, at least 1 record is guaranteed to match your search term, but other records may not.
In order to get the inner join (i.e. only including combined records that match search terms in the table(s) queried), you simply have to roll-your-own, because Django doesn't support it. I accomplished this in the following way:
When rendering search results, wherever you want to include records from a M:M related table, you have to create a nested loop. At the inner-most loop, I essentially re-enforce that the record matches all of the search terms. I.e. I implement my own filter using conditionals on each combination of records (from the root table and from each of the M:M related table). If any record combination does not match, I skip it.
Regarding the particulars as to how I did that, I won't get too into the weeds, but at each nested loop, I maintain a dict of its key path (e.g. msrun__sample__animal__studies) as the key and the current record as the value. I then use a custom simple_tag to re-run the search terms against the current record from each table (by checking the key path of the search term against those available in the dict).
Alternatively, you could do this in the view by compiling all of the matching records and sending the combined large table to the template - but I opted not to do this because of the hurdles surrounding cached_properties and since I already pass the search term form data to re-display the executed search form, the structure of the search was already available.
Watch out #1 - Even without "refiltering" the combined records in the template, note that the record count can be inaccurate when dealing with a combined/composite/joined table. To ensure that the number of records I report in my header above the table was always correct/accurate (i.e. represented the number of rows in my html table), I kept track of the filtered count and used javascript under the table to update the record count reported at the top.
Watch out #2 - There is one other thing to keep in mind when querying using terms from M:M related tables. For every M:M related table record that matches the search term, you will get back a duplicate record from the root table, and there's no way to tell the difference between them (i.e. which M:M record/field value matched in each case. For example, if you matched 2 records from an M:M related table, you would get back 2 identical root table records, and when you pass that data to the template, you would end up with 4 rows in your results, each with the same data from the root table record. To avoid this, all you have to do is append the .distinct() filter to you results queryset.

Understanding Django JSONField key-path queries and exhaustive sets

While looking at the Django docs on querying JSONField, I came upon a note stating:
Due to the way in which key-path queries work, exclude() and filter() are not guaranteed to produce exhaustive sets. If you want to include objects that do not have the path, add the isnull lookup.
Can someone give me an example of a query that would not produce an exhaustive set? I'm having a pretty hard time coming up with one.
This is the ticket that resulted in the documentation that you quoted: https://code.djangoproject.com/ticket/31894
TL;DR: To get the inverse of .filter() on a JSON key path, it is not sufficient to only use .exclude() with the same clause since it will only give you records where the JSON key path is present but has a different value and not records where the JSON key path is not present at all. That's why it says:
If you want to include objects that do not have the path, add the isnull lookup.
If I may quote the ticket description here:
Filtering based on a JSONField key-value pair seems to have some
unexpected behavior when involving a key that not all records have.
Strangely, filtering on an optional property key will not return the
inverse result set that an exclude on the same property key will
return.
In my database, I have:
2250 total records 49 records where jsonfieldname = {'propertykey': 'PropertyValue'}
296 records where jsonfieldname has a 'propertykey' key with some other value
1905 records where jsonfieldname does not have a 'propertykey' key at all
The following code:
q = Q(jsonfieldname__propertykey="PropertyValue")
total_records = Record.objects.count()
filtered_records = Record.objects.filter(q).count()
excluded_records = Record.objects.exclude(q).count()
filtered_plus_excluded_records = filtered_records + excluded_records
print('Total: %d' % total_records)
print('Filtered: %d' % filtered_records)
print('Excluded: %d' % excluded_records)
print('Filtered Plus Excluded: %d' % filtered_plus_excluded_records)
Will output this:
Total: 2250
Filtered: 49
Excluded: 296
Filtered Plus Excluded: 345
It is surprising that the filtered+excluded value is not equal to the total record count. It's surprising that the union of a expression plus its inverse does not equal the sum of all records. I am not aware of any other queries in Django that would return a result like this. I realize adding a check that the key exists would return a more expected results, but that doesn't stop the above from being surprising.
I'm not sure what a solution would be - either a note in the documentation that this behavior should be expected, or take a look at how this same expression is applied for both the exclude() and filter() queries and see why they are not opposites.

Why is my chained Django queryset not working?

latest_entries = Entry.objects.filter(
zipcode=request.user.my_profile.nearbyzips1
).filter(
zipcode=request.user.my_profile.nearbyzips2
).filter(
zipcode=request.user.my_profile.nearbyzips3
)
This does not seem to return any Entry objects, even though it should.
Note: If I were to remove all the chaining it just leave the initial nearbyzips1 filter, it returns all Entry objects that match that zipcode. So this tells me that my chaining is breaking something.
What am I doing incorrectly?
I am not using any m2m or foreign keys.
I guess you need to find all entries containing one of the given zipcodes. The correct approach is:
Entry.objects.filter(zipcode__in=[
request.user.my_profile.nearbyzips1,
request.user.my_profile.nearbyzips2,
request.user.my_profile.nearbyzips3
])
This query returns all entries having as zipcode one of the values of the array. The query you gave just tries to find all entries with zipcode equal to all given zipcodes at the same time. So it's normal to return nothing.

Remove duplicates in Django ORM -- multiple rows

I have a model that has four fields. How do I remove duplicate objects from my database?
Daniel Roseman's answer to this question seems appropriate, but I'm not sure how to extend this to situation where there are four fields to compare per object.
Thanks,
W.
def remove_duplicated_records(model, fields):
"""
Removes records from `model` duplicated on `fields`
while leaving the most recent one (biggest `id`).
"""
duplicates = model.objects.values(*fields)
# override any model specific ordering (for `.annotate()`)
duplicates = duplicates.order_by()
# group by same values of `fields`; count how many rows are the same
duplicates = duplicates.annotate(
max_id=models.Max("id"), count_id=models.Count("id")
)
# leave out only the ones which are actually duplicated
duplicates = duplicates.filter(count_id__gt=1)
for duplicate in duplicates:
to_delete = model.objects.filter(**{x: duplicate[x] for x in fields})
# leave out the latest duplicated record
# you can use `Min` if you wish to leave out the first record
to_delete = to_delete.exclude(id=duplicate["max_id"])
to_delete.delete()
You shouldn't do it often. Use unique_together constraints on database instead.
This leaves the record with the biggest id in the DB. If you want to keep the original record (first one), modify the code a bit with models.Min. You can also use completely different field, like creation date or something.
Underlying SQL
When annotating django ORM uses GROUP BY statement on all model fields used in the query. Thus the use of .values() method. GROUP BY will group all records having those values identical. The duplicated ones (more than one id for unique_fields) are later filtered out in HAVING statement generated by .filter() on annotated QuerySet.
SELECT
field_1,
…
field_n,
MAX(id) as max_id,
COUNT(id) as count_id
FROM
app_mymodel
GROUP BY
field_1,
…
field_n
HAVING
count_id > 1
The duplicated records are later deleted in the for loop with an exception to the most frequent one for each group.
Empty .order_by()
Just to be sure, it's always wise to add an empty .order_by() call before aggregating a QuerySet.
The fields used for ordering the QuerySet are also included in GROUP BY statement. Empty .order_by() overrides columns declared in model's Meta and in result they're not included in the SQL query (e.g. default sorting by date can ruin the results).
You might not need to override it at the current moment, but someone might add default ordering later and therefore ruin your precious delete-duplicates code not even knowing that. Yes, I'm sure you have 100% test coverage…
Just add empty .order_by() to be safe. ;-)
https://docs.djangoproject.com/en/3.2/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Transaction
Of course you should consider doing it all in a single transaction.
https://docs.djangoproject.com/en/3.2/topics/db/transactions/#django.db.transaction.atomic
If you want to delete duplicates on single or multiple columns, you don't need to iterate over millions of records.
Fetch all unique columns (don't forget to include the primary key column)
fetch = Model.objects.all().values("id", "skuid", "review", "date_time")
Read the result using pandas (I did using pandas instead ORM query)
import pandas as pd
df = pd.DataFrame.from_dict(fetch)
Drop duplicates on unique columns
uniq_df = df.drop_duplicates(subset=["skuid", "review", "date_time"])
## Dont add primary key in subset you dumb
Now, you'll get the unique records from where you can pick the primary key
primary_keys = uniq_df["id"].tolist()
Finally, it's show time (exclude those id's from records and delete rest of the data)
records = Model.objects.all().exclude(pk__in=primary_keys).delete()

Django: Equivalent of "select [column name] from [tablename]"

I wanted to know is there anything equivalent to:
select columnname from tablename
Like Django tutorial says:
Entry.objects.filter(condition)
fetches all the objects with the given condition. It is like:
select * from Entry where condition
But I want to make a list of only one column [which in my case is a foreign key]. Found that:
Entry.objects.values_list('column_name', flat=True).filter(condition)
does the same. But in my case the column is a foreign key, and this query loses the property of a foreign key. It's just storing the values. I am not able to make the look-up calls.
Of course, values and values_list will retrieve the raw values from the database. Django can't work its "magic" on a model which means you don't get to traverse relationships because you're stuck with the id the foreign key is pointing towards, rather than the ForeignKey field.
If you need to filters those values, you could do the following (assuming column_name is a ForeignKey pointing to MyModel):
ids = Entry.objects.values_list('column_name', flat=True).filter(...)
my_models = MyModel.objects.filter(pk__in=set(ids))
Here's a documentation for values_list()
To restrict a query set to a specific column(s) you use .values(columname)
You should also probably add distinct to the end, so your query will end being:
Entry.objects.filter(myfilter).values(columname).distinct()
See: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.values
for more information
Depending on your answer in the comment, I'll come back and edit.
Edit:
I'm not certain if the approach is right one though. You can get all of your objects in a python list by getting a normal queryset via filter and then doing:
myobjectlist = map(lambda x: x.mycolumnname, myqueryset)
The only problem with that approach is if your queryset is large your memory use is going to be equally large.
Anyway, I'm still not certain on some of the specifics of the problem.
You have a model A with a foreign key to another model B, and you want to select the Bs which are referred to by some A. Is that right? If so, the query you want is just:
B.objects.filter(a__isnull = False)
If you have conditions on the corresponding A, then the query can be:
B.objects.filter(a__field1 = value1, a__field2 = value2, ...)
See Django's backwards relation documentation for an explanation of why this works, and the ForeignKey.related_name option if you want to change the name of the backwards relation.