Partial Keyword Search and Ranking - django

Using Django and Postgres, I have an investment holding model like so:
class Holding(BaseModel):
name = models.CharField(max_length=255, db_index=True)
symbol = models.CharField(max_length=16, db_index=True)
fund_codes = ArrayField(models.CharField(max_length=16), blank=True, default=list)
...
That contains a list of approximately 70k US/CAN equity, mutual funds. I want to build an autocomplete search function that prioritizes 1) ranking of exact match of the symbol or fund_codes, followed by 2) Near matches on the symbol, then 3) Full text search of holding name.
If I have a search vector that adds more weight to the symbol and fund_codes:
from django.contrib.postgres.search import SearchVector, SearchQuery, SearchRank
from django.db.models import F, Func, Value
vector = SearchVector('name', weight='D') + \
SearchVector('symbol', weight='A') + \
SearchVector(Func(F('fund_codes'), Value(' '), function='array_to_string'), weight='A')
Then, searching 'MA'
Investment.objects \
.annotate(document=vector, rank=SearchRank(vector, query)) \
.filter(document__icontains='MA') \
.order_by('-rank') \
.values_list('name', 'fund_codes', 'symbol', 'rank',)
Doesn't give the results I need. I need MA (Mastercard) as top listing, Then MAS (Masco Corp), etc... Then listings containing 'MA' in the name field.
I've also looked at overriding SearchQuery with:
class MySearchQuery(SearchQuery):
def as_sql(self, compiler, connection):
params = [self.value]
if self.config:
config_sql, config_params = compiler.compile(self.config)
template = 'to_tsquery({}::regconfig, %s)'.format(config_sql)
params = config_params + [self.value]
else:
template = 'to_tsquery(%s)'
if self.invert:
template = '!!({})'.format(template)
return template, params
But still not getting results I need. Any suggestions on how I should approach search functionality in this use case? Perhaps concatenate an exact search query and a full-text search query?

What you need is to pass in normalization parameter. This will give higher ranking to names that are exact match. Raw query looks like following:
SELECT id, name, symbol, func_codes,
ts_rank_cd(to_tsvector(func_codes), to_tsquery('MA'), 2 ) as rank
FROM Holding
ORDER BY rank DESC
LIMIT 100;
Notice that I passed in a normalization parameter https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING
How to do it in Django?
I believe that django doesn't yet support passing normalization yet. I see a open ticket for it, but its 2 years old. Maybe no one has worked on it yet.
https://code.djangoproject.com/ticket/28194
You can use raw query for now. See official documentation on how:
https://docs.djangoproject.com/en/2.2/topics/db/sql/

Related

Django query set filter reverse startswith on charfield

Image some kind of product-rule which has 2 conditions:
name are equal
sku's have partial match, starts with.
The rule model looks like this:
class CreateAndAssignRule(models.Model):
name_equals = models.CharField(max_length=100)
sku_starts_with = models.CharField(max_length=100
Now I want to fetch all of the rules with name Product 1 and match sku sku-b-292
class CreateAndAssignRuleQuerySet(QuerySet):
def filter_by_name_and_sku(self, name, sku):
# We're looking which of the rules have a matching name, and where the rule have a string which is the string of the sku provided.
rules = self.filter(name_equals=name)
approved_ids = []
for rule in rules:
# We're looping through the rules to find out which of them has the beginnings of the sku.
# a sku_starts_with field would contains value eg: 'sku-a' where as the search string would be the full sku 'sku-a-111'. We want to match 'sku-a-111' but not 'sku-b-222'.
if sku.startswith(rule.sku_starts_with):
approved.append(rule.id)
return self.filter(id__in=approved_ids)
although the above works, it's hardly efficient especially as the number of rule is starting to grow a lot.
How can I resolve this with a queryset? Filtering on __startswith doesn't do the trick as it the reverse.
Filter with:
from django.db.models import F, Value
class CreateAndAssignRuleQuerySet(QuerySet):
def filter_by_name_and_sku(self, name, sku):
return self.alias(
sku=Value(sku)
).filter(
name_equals=name,
sku__startswith=F('sku_starts_with')
)
We thus here inject the sku in the queryset, and then use this to work with a __startswith lookup [Django-doc].

django + postgres and select distinct

I am running django with postgres and I need to query some record from a table, sorting them by rank, and get unique entry in respect of a foreign key.
Basically my model is something like this:
class BookingCatalog(models.Model):
.......
boat = models.ForeignKey(Boat, verbose_name=u"Boat", related_name="booking_catalog")
is_skippered = models.BooleanField(u'Is Skippered',choices=SKIPPER_CHOICE, default=False)
rank = models.IntegerField(u"Rank", default=0, db_index=True)
.......
The idea is to run something like this
BookingCatalog.objects.filter (...).order_by ('-rank', 'boat', 'is_skippered').distinct ('boat')
Unfortunately, this is not working since I am using postgres which raises this exception:
SELECT DISTINCT ON expressions must match initial ORDER BY expressions
What should I do instead?
The distinct argument has to match the first order argument. Try using this:
BookingCatalog.objects.filter(...) \
.order_by('boat', '-rank', 'is_skippered') \
.distinct('boat')
The way that I do this is to select the distinct objects first, then use those results to filter another queryset.
# Initial filtering
result = BookingCatalog.objects.filter(...)
# Make the results distinct
result = result.order_by('boat').distinct('boat')
# Extract the pks from the result
result_pks = result.values_list("pk", flat=True)
# Use those result pks to create a new queryset
restult_2 = BookingCatalog.objects.filter(pk__in=result_pks)
# Order that queryset
result_2 = result_2.order_by('-rank', 'is_skippered')
print(result_2)
I believe that this results in a single query being executed, which contains a subquery. I would love for someone who knows more about Django to confirm this though.
..ordering by -rank will give you the lowest rank of each duplicate, but your overall query results will be ordered by boat field
BookingCatalog.objects.filter (...).order_by('boat','-rank','is_skippered').distinct('boat')
For more info on, refer to Django documentation
including for Postgres

Django postgres HStoreField order by

Can I order the results of QuerySet by values inside of HStoreField, for example I've got model:
class Product(model.Models):
name = CharField(max_length=100)
properties = HStoreField()
And I want to store some properties of my product in HStoreField like:
{ 'discount': '10', 'color': 'white'}
In view I want to order the resulting QuerySet by discount.
The above answer does not work. Order transforms were never implemented for HStoreField, see https://code.djangoproject.com/ticket/24747.
But the suggestion in https://code.djangoproject.com/ticket/24592 works. Here is some more detail.
from django.contrib.gis.db.models import TextField, HStoreField, Model
from django.db.models import F, Func, Value
class MyThing(Model):
name: TextField()
keys: HStoreField()
things = [MyThing(name='foo'),
MyThing(name='bar'),
MyThing(name='baz')]
things[0].keys['movie'] = 'Jaws'
things[1].keys['movie'] = 'Psycho'
things[2].keys['movie'] = 'The Birds'
things[0].keys['rating'] = 5
things[1].keys['rating'] = 4
things[2].keys['year'] = '1963'
# Informal search
MyThing.objects\
.filter(keys__has_key='rating')\
.order_by(Func(F('keys'), Value('movie'),
function='',
arg_joiner=' -> ',
output_field=TextField()))
The formal search is exactly as described in the second link above. Use the imports in the above snippet with that code.
This might work (I cannot test right now):
.order_by("properties -> 'discount'")
But be aware that HSTORE values are strings, so you do not get numeric order but instead items are ordered as strings.
Do get proper numeric order, you should extract the properties->discount key as separate column and cast it to integer.

django paginate with non-model object

I'm working on a side project using python and Django. It's a website that tracks the price of some product from some website, then show all the historical price of products.
So, I have this class in Django:
class Product(models.Model):
price = models.FloatField()
date = models.DateTimeField(auto_now = True)
name = models.CharField()
Then, in my views.py, because I want to display products in a table, like so:
+----------+--------+--------+--------+--------+....
| Name | Date 1 | Date 2 | Date 3 |... |....
+----------+--------+--------+--------+--------+....
| Product1 | 100.0 | 120.0 | 70.0 | ... |....
+----------+--------+--------+--------+--------+....
...
I'm using the following class for rendering:
class ProductView(objects):
name = ""
price_history = {}
So that in my template, I can easily convert each product_view object into one table row. I'm also passing through context a sorted list of all available dates, for the purpose of constructing the head of the table, and getting the price of each product on that date.
Then I have logic in views that converts one or more products into this ProductView object. The logic looks something like this:
def conversion():
result_dict = {}
all_products = Product.objects.all()
for product in all_products:
if product.name in result_dict:
result_dict[product.name].append(product)
else:
result_dict[product.name] = [product]
# So result_dict will be like
# {"Product1":[product, product], "Product2":[product],...}
product_views = []
for products in result_dict.values():
# Logic that converts list of Product into ProductView, which is simple.
# Then I'm returning the product_views, sorted based on the price on the
# latest date, None if not available.
return sorted(product_views,
key = lambda x: get_latest_price(latest_date, x),
reverse = True)
As per Daniel Roseman and zymud, adding get_latest_price:
def get_latest_price(date, product_view):
if date in product_view.price_history:
return product_view.price_history[date]
else:
return None
I omitted the logic to get the latest date in conversion. I have a separate table that only records each date I run my price-collecting script that adds new data to the table. So the logic of getting latest date is essentially get the date in OpenDate table with highest ID.
So, the question is, when product grows to a huge amount, how do I paginate that product_views list? e.g. if I want to see 10 products in my web application, how to tell Django to only get those rows out of DB?
I can't (or don't know how to) use django.core.paginator.Paginator, because to create that 10 rows I want, Django needs to select all rows related to that 10 product names. But to figure out which 10 names to select, it first need to get all objects, then figure out which ones have the highest price on the latest date.
It seems to me the only solution would be to add something between Django and DB, like a cache, to store that ProductView objects. but other than that, is there a way to directly paginate produvt_views list?
I'm wondering if this makes sense:
The basic idea is, since I'll need to sort all product_views by the price on the "latest" date, I'll do that bit in DB first, and only get the list of product names to make it "paginatable". Then, I'll do a second DB query, to get all the products that have those product names, then construct that many product_views. Does it make sense?
To clear it a little bit, here comes the code:
So instead of
#def conversion():
all_products = Product.objects.all()
I'm doing this:
#def conversion():
# This would get me the latest available date
latest_date = OpenDate.objects.order_by('-id')[:1]
top_ten_priced_product_names = Product.objects
.filter(date__in = latest_date)
.order_by('-price')
.values_list('name', flat = True)[:10]
all_products_that_i_need = Product.objects
.filter(name__in = top_ten_priced_product_names)
# then I can construct that list of product_views using
# all_products_that_i_need
Then for pages after the first, I can modify that [:10] to say [10:10] or [20:10].
This makes the code pagination easier, and by pulling appropriate code into a separate function, it's also possible to do Ajax and all those fancy stuff.
But, here comes a problem: this solution needs three DB calls for every single query. Right now I'm running everything on the same box, but still I want to reduce this overhead to two(One or Opendate, the other for Product).
Is there a better solution that solves both the pagination problem and with two DB calls?

Retrieving untagged objects with django-tagging

What I'm looking for is a QuerySet containing any objects not tagged.
The solution I've come up with so far looks overly complicated to me:
# Get all tags for model
tags = Location.tags.all().order_by('name')
# Get a list of tagged location id's
tag_list = tags.values_list('name', flat=True)
tag_names = ', '.join(tag_list)
tagged_locations = Location.tagged.with_any(tag_names) \
.values_list('id', flat=True)
untagged_locations = []
for location in Location.objects.all():
if location.id not in tagged_locations:
untagged_locations.append(location)
Any ideas for improvement? Thanks!
There is some good information in this post, so I don't feel that it should be deleted, but there is a much, much simpler solution
I took a quick peek at the source code for django-tagging. It looks like they use the ContentType framework and generic relations to pull it off.
Because of this, you should be able to create a generic reverse relation on your Location class to get easy access to the TaggedItem objects for a given location, if you haven't already done so:
from django.contrib.contenttypes import generic
from tagging.models import TaggedItem
class Location(models.Model):
...
tagged_items = generic.GenericRelation(TaggedItem,
object_id_field="object_id",
content_type_field="content_type")
...
Clarification
My original answer suggested to do this:
untagged_locs = Location.objects.filter(tagged_items__isnull=True)
Although this would work for a 'normal join', this actually doesn't work here because the content type framework throws an additional check on content_type_id into the SQL for isnull:
SELECT [snip] FROM `sotest_location`
LEFT OUTER JOIN `tagging_taggeditem`
ON (`sotest_location`.`id` = `tagging_taggeditem`.`object_id`)
WHERE (`tagging_taggeditem`.`id` IS NULL
AND `tagging_taggeditem`.`content_type_id` = 4 )
You can hack-around it by reversing it like this:
untagged_locs = Location.objects.exclude(tagged_items__isnull=False)
But that doesn't quite feel right.
I also proposed this, but it was pointed out that annotations don't work as expected with the content types framework.
from django.db.models import Count
untagged_locs = Location.objects.annotate(
num_tags=Count('tagged_items')).filter(num_tags=0)
The above code works for me in my limited test case, but it could be buggy if you have other 'taggable' objects in your model. The reason being that it doesn't check the content_type_id as outlined in the ticket. It generated the following SQL:
SELECT [snip], COUNT(`tagging_taggeditem`.`id`) AS `num_tags`
FROM `sotest_location`
LEFT OUTER JOIN `tagging_taggeditem`
ON (`sotest_location`.`id` = `tagging_taggeditem`.`object_id`)
GROUP BY `sotest_location`.`id` HAVING COUNT(`tagging_taggeditem`.`id`) = 0
ORDER BY NULL
If Location is your only taggable object, then the above would work.
Proposed Workaround
Short of getting the annotation mechanism to work, here's what I would do in the meantime:
untagged_locs_e = Location.objects.extra(
where=["""NOT EXISTS(SELECT 1 FROM tagging_taggeditem ti
INNER JOIN django_content_type ct ON ti.content_type_id = ct.id
WHERE ct.model = 'location'
AND ti.object_id = myapp_location.id)"""]
)
This adds an additional WHERE clause to the SQL:
SELECT [snip] FROM `myapp_location`
WHERE NOT EXISTS(SELECT 1 FROM tagging_taggeditem ti
INNER JOIN django_content_type ct ON ti.content_type_id = ct.id
WHERE ct.model = 'location'
AND ti.object_id = myapp_location.id)
It joins to the django_content_type table to ensure that you're looking at the appropriate
content type for your model in the case where you have more than one taggable model type.
Change myapp_location.id to match your table name. There's probably a way to avoid hard-coding the table names, but you can figure that out if it's important to you.
Adjust accordingly if you're not using MySQL.
Try this:
[location for location in Location.objects.all() if location.tags.count() == 0]
Assuming your Location class uses the tagging.fields.TagField utility.
from tagging.fields import TagField
class Location(models.Model):
tags = TagField()
You can just do this:
Location.objects.filter(tags='')