Understanding Django JSONField key-path queries and exhaustive sets - django

While looking at the Django docs on querying JSONField, I came upon a note stating:
Due to the way in which key-path queries work, exclude() and filter() are not guaranteed to produce exhaustive sets. If you want to include objects that do not have the path, add the isnull lookup.
Can someone give me an example of a query that would not produce an exhaustive set? I'm having a pretty hard time coming up with one.

This is the ticket that resulted in the documentation that you quoted: https://code.djangoproject.com/ticket/31894
TL;DR: To get the inverse of .filter() on a JSON key path, it is not sufficient to only use .exclude() with the same clause since it will only give you records where the JSON key path is present but has a different value and not records where the JSON key path is not present at all. That's why it says:
If you want to include objects that do not have the path, add the isnull lookup.
If I may quote the ticket description here:
Filtering based on a JSONField key-value pair seems to have some
unexpected behavior when involving a key that not all records have.
Strangely, filtering on an optional property key will not return the
inverse result set that an exclude on the same property key will
return.
In my database, I have:
2250 total records 49 records where jsonfieldname = {'propertykey': 'PropertyValue'}
296 records where jsonfieldname has a 'propertykey' key with some other value
1905 records where jsonfieldname does not have a 'propertykey' key at all
The following code:
q = Q(jsonfieldname__propertykey="PropertyValue")
total_records = Record.objects.count()
filtered_records = Record.objects.filter(q).count()
excluded_records = Record.objects.exclude(q).count()
filtered_plus_excluded_records = filtered_records + excluded_records
print('Total: %d' % total_records)
print('Filtered: %d' % filtered_records)
print('Excluded: %d' % excluded_records)
print('Filtered Plus Excluded: %d' % filtered_plus_excluded_records)
Will output this:
Total: 2250
Filtered: 49
Excluded: 296
Filtered Plus Excluded: 345
It is surprising that the filtered+excluded value is not equal to the total record count. It's surprising that the union of a expression plus its inverse does not equal the sum of all records. I am not aware of any other queries in Django that would return a result like this. I realize adding a check that the key exists would return a more expected results, but that doesn't stop the above from being surprising.
I'm not sure what a solution would be - either a note in the documentation that this behavior should be expected, or take a look at how this same expression is applied for both the exclude() and filter() queries and see why they are not opposites.

Related

Django filter, filter return more than this source

print("Step 1",invs.count()) # -> 1000 # invs type: query
invs2 = invs.filter(field___fields2__fields3=i) # i type:int
print("Step 2",invs2.count()) # -> 40000
Is it normal for the filter function to return more than its origin ?
Thank you.
Yes, theres an entire section in the docs that explain it
Lookups that span relationships
Inside the big green note block further down below the "Spanning multi-valued relationships" heading it states
However, unlike the behavior when using filter(), this will not limit blogs based on entries that satisfy both conditions. In order to do that, i.e. to select all blogs that do not contain entries published with “Lennon” that were published in 2008, you need to make two queries:
The relevant information can be found in the description of distinct().
A cite:
Returns a new QuerySet that uses SELECT DISTINCT in its SQL query. This eliminates duplicate rows from the query results.
By default, a QuerySet will not eliminate duplicate rows. In practice, this is rarely a problem, because simple queries such as Blog.objects.all() don’t introduce the possibility of duplicate result rows. However, if your query spans multiple tables, it’s possible to get duplicate results when a QuerySet is evaluated. That’s when you’d use distinct().

Total number of documents in pysolr

How can I get the total number of documents matching the given query. I have use the query below:
result = solr.search('ad_id : 20')
print(len(result))
Since the default returning value is '10', the output is only 10, but the count is 4000. How can I get the total number of counts?
The results object from pysolr has a hits property that contains the total number of hits, regardless of how many documents being returned. This is named numFound in the raw response from Solr.
Your solution isn't really suitable for anything with a larger dataset, since it requires you to retrieve all the documents, even if you don't need them or want to show their content.
The count is stored in numFound variable. Use the code below:
result = solr.search('ad_id : 20')
print(result.raw_response['response']['numFound'])
As #MatsLindh mentioned -
result = solr.search('ad_id : 20')
print(result.hits)
Finally got the answer:
Added rows=1000000 at the end of the query.
result = solr.search('ad_id : 20', rows=1000000)
But if the rows are greater than this the number should be changed in the query. This might be a bad solution but works.
If anyone has a better solution please do reply.
If you just want the total number of items that satisfy your query, here is my Python3 code (using the pysolr module):
collection='bookindex' # or whatever your collection is called
solr_url = f"http://{SOLR_HOST}/solr/{collection}"
solr = pysolr.Solr(url=solr_url, timeout=120, always_commit=True)
result = solr.search("*:*", rows=0);
return result.hits
This queries for all documents (":") -- 315913 in my case -- but you can narrow that to suit your requirements. For example, if I want to know how many of my book entries have title:pandas I can search("title:pandas", rows=0) and get 41 as the number that have pandas in the title. By setting rows=0 you're letting Solr know that it need not format any results for you but you just return the meta information, and thus much more efficient than setting a high limit on rows.

django - queryset.last() not returning right order / last record

Writing code that is generating JSON. The last section of JSON has to be terminated by a ",", so in the code I have:
-- Define a queryset to retrieve distinct values of the database field:
databases_in_workload = DatabaseObjectsWorkload.objects.filter(workload=migration.workload_id).values_list('database_object__database', flat=True).distinct()
-- Then I cycle over it:
for database_wk in databases_in_workload:
... do something
if not (database_wk == databases_in_workload.last()):
job_json_string = job_json_string + '} ],'
else:
job_json_string = job_json_string + '} ]'
I want the last record to be terminated by a square bracket, the preceding by a comma. But instead, the opposite is happening.
I also looked at the database table content. The values I have for "database_wk" are user02 (for the records with a lower value of primary key) and user01 (for the records with the higher value of pk in the DB). The order (if user01 is first or last) really doesn't matter, as long as the last record is correctly identified by last() - so if I have user02, user01 in the query set iterations, I expect last() to return user01. However - this is not working correctly.
What is strange is that if in the database (Postgres) order is changed (first have user01, then user02 ordered by primary key values) then the "if" code above works, but in my situation last() seems to be returning the first record, not the last. It's as if there is one order in the database, another in the query set, and last() is taking the database order... Anybody encountered/solved this issue before? Alternatively - any other method for identifying the last record in a query set (other than last()) which I could try would also help. Many thanks in advance!
The reason is behaving the way it does is because there is no ordering specified. Try using order_by. REF
From: queryset.first()
If the QuerySet has no ordering defined, then the queryset is automatically ordered by the primary key
From: queryset.last()
Works like first(), but returns the last object in the queryset.
If you don't want to use order_by then try using queryset.latest()

Django object filter - price behaving strangely, eg 170 treated as 17 etc

I have a simple object filter that uses price__lt and price__gt. This works on a property on my product model called price, which is a CharField [string] (decimal saw the same errors, and caused trouble with aggregation so reverted to string).
It seems that when passing in these values to the filter, they are treated in a strange way, eg 10 is treated as 100. for example:
/products/price/10-200/ returns products priced 100-200. the filters are being passed in as filterargs: FILTER ARGS: {'price__lt': '200', 'price__gt': '10'} . This also breaks in the sense that price/0-170 will NOT return products priced at 18.50; it is treating the 170 as 'less than 18' for some reason.
any idea what would cause this, and how to fix it? Thanks!
The problem, as Jeff suggests, is that price is a CharField and thus is being compared using character-by-character string comparison logic, i.e. any string of any length starting with 1 will be less than any string of any length starting with 2, etc.
I'm curious what problems you were having with having price be an IntegerField, as that would seem to be the straightforward solution, but if you need to keep price as a CharField, here's a (hacky) way to make the query work:
lt = 200
gt = 10
qs = Product.objects.extra(select={'int_price': 'cast(price as int)'},
where=['int_price < %s', 'int_price > %s'],
params=[lt, gt])
qs.all() # the result
This uses the extra method of Django's QuerySet class, which you can read about in the docs here. In a nutshell, it computes an integer version of the string price using SQL's cast expression and then filters with integers based on that.

Django Object Filter (last 1000)

How would one go about retrieving the last 1,000 values from a database via a Objects.filter? The one I am currently doing is bringing me the first 1,000 values to be entered into the database (i.e. 10,000 rows and it's bringing me the 1-1000, instead of 9000-1,000).
Current Code:
limit = 1000
Shop.objects.filter(ID = someArray[ID])[:limit]
Cheers
Solution:
queryset = Shop.objects.filter(id=someArray[id])
limit = 1000
count = queryset.count()
endoflist = queryset.order_by('timestamp')[count-limit:]
endoflist is the queryset you want.
Efficiency:
The following is from the django docs about the reverse() queryset method.
To retrieve the ''last'' five items in
a queryset, you could do this:
my_queryset.reverse()[:5]
Note that this is not quite the same
as slicing from the end of a sequence
in Python. The above example will
return the last item first, then the
penultimate item and so on. If we had
a Python sequence and looked at
seq[-5:], we would see the fifth-last
item first. Django doesn't support
that mode of access (slicing from the
end), because it's not possible to do
it efficiently in SQL.
So I'm not sure if my answer is merely inefficient, or extremely inefficient. I moved the order_by to the final query, but I'm not sure if this makes a difference.
reversed(Shop.objects.filter(id=someArray[id]).reverse()[:limit])