To Use Django-Haystack or not? - django

So this might be an obvious answer to some but I'm not sure what the right answer is. I have a simple donation application where Donor objects get created through a form. A feature to be added is to allow of a search for each Donor by last name and or phone number.
Is this a good case to use django-haystack or should I just create my own filters? The problem I may see with haystack is that a few donations are being submitted every minute so could indexing be a problem? There are currently around 130,000 records and growing. I have started to implement haystack but have realized it might not be necessary?

Don't use haystack -- that's for fast full-text search when the underlying relational database can't handle it easily. The use case for haystack is when you store many large documents with huge chunks of text that you want indexed by words in the document so you can easily search.
Django by default already allows you to easily index/search text records. For example, using the admin backend simply specify search fields and you can easily search for name or telephone number. (And it will generally do case insensitive contains searches -- this will find partial matches; e.g., the name "John Doe" will come up if you search for just "doe" or "ohn").
So if your models.py has:
class Donor(models.Model):
name = models.CharField(max_length=50)
phone = models.CharField(max_length=15)
and an admin.py with:
from django.contrib import admin
from mysite.myapp.models import Donor
class DonorAdmin(admin.ModelAdmin):
model = Donor
search_fields = ['name', 'phone']
admin.site.register(Donor, DonorAdmin)
it should work fine. If an improvement is needed consider adding an full-text index to the underlying RDBMS. For example, with postgres you can create either a text search indexes post 8.3 with a one liner in the underlying database, which django should automatically use: http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

Related

Django: How to do full-text search for Japanese (multibyte strings) in Postgresql

It is possible to create an index for searching using SearchVector, but
However, Japanese words are not separated by spaces, and the full-text search does not work properly.
How can I perform full-text search in Japanese (multi-byte character strings)?
I thought about implementing a search engine such as ElasticSearch, but other problems came up.
If possible, I would like to do FTS with Postgres.
# models.py
class Post(models.Model):
title = models.CharField(max_length=300)
search = SearchVectorField(null=True)
class Meta:
indexes = [GinIndex(fields=["search"])]
# update search column
Post.objects.update(search=SearchVector('title'))
Look at the Pgroonga Postgres extension for fulltext search in all languages. It is used by the Zulip project with amazing results.

Filter multiple Django model fields with variable number of arguments

I'm implementing search functionality with an option of looking for a record by matching multiple tables and multiple fields in these tables.
Say I want to find a Customer by his/her first or last name, or by ID of placed Order which is stored in different model than Customer.
The easy scenario which I already implemented is that a user only types single word into search field, I then use Django Q to query Order model using direct field reference or related_query_name reference like:
result = Order.objects.filter(
Q(customer__first_name__icontains=user_input)
|Q(customer__last_name__icontains=user_input)
|Q(order_id__icontains=user_input)
).distinct()
Piece of a cake, no problems at all.
But what if user wants to narrow the search and types multiple words into search field.
Example: user has typed Bruce and got a whole lot of records back as a result of search.
Now he/she wants to be more specific and adds customer's last name to search.So the search becomes Bruce Wayne, after splitting this into separate parts I'm having Bruce and Wayne. Obviously I don't want to search Orders model because order_id is a single-word instance and it's sufficient to find customer at once so for this case I'm dropping it out of query at all.
Now I'm trying to match customer by both first AND last name, I also want to handle the scenario where the order of provided data is random, to properly handle Bruce Wayne and Wayne Bruce, meaning I still have customers full name but the position of first and last name aren't fixed.
And this is the question I'm looking answer for: how to build query that will search multiple fields of model not knowing which of search words belongs to which table.
I'm guessing the solution is trivial and there's for sure an elegant way to create such a dynamic query, but I can't think of a way how.
You can dynamically OR a variable number of Q objects together to achieve your desired search. The approach below makes it trivial to add or remove fields you want to include in the search.
from functools import reduce
from operator import or_
fields = (
'customer__first_name__icontains',
'customer__last_name__icontains',
'order_id__icontains'
)
parts = []
terms = ["Bruce", "Wayne"] # produce this from your search input field
for term in terms:
for field in fields:
parts.append(Q(**{field: term}))
query = reduce(or_, parts)
result = Order.objects.filter(query).distinct()
The use of reduce combines the Q objects by ORing them together. Credit to that part of the answer goes to this answer.
The solution I came up with is rather complex, but it works exactly the way I wanted to handle this problem:
search_keys = user_input.split()
if len(search_keys) > 1:
first_name_set = set()
last_name_set = set()
for key in search_keys:
first_name_set.add(Q(customer__first_name__icontains=key))
last_name_set.add(Q(customer__last_name__icontains=key))
query = reduce(and_, [reduce(or_, first_name_set), reduce(or_, last_name_set)])
else:
search_fields = [
Q(customer__first_name__icontains=user_input),
Q(customer__last_name__icontains=user_input),
Q(order_id__icontains=user_input),
]
query = reduce(or_, search_fields)
result = Order.objects.filter(query).distinct()

Haystack scores make no sense

I'm using haystack with elastic search for a project, but the scores I get make no sense (to me).
The model I'm trying to index and search looks similar to:
class Car(models.Model):
name = models.CharField(max_length=255)
class Color(models.Model):
car = models.ForeignKey(Car)
name = models.CharField(max_length=255)
And the search index, even if I'm interested in cars, I want to search them by color as I want to display a pic of that color specifically:
class CarIndex(indexes.SearchIndex, indexes.Indexable):
text = CharField(document=True)
def get_model(self):
return Color
def prepare_text(self, obj):
# Some cleaning
return " ".join([obj.name, obj.car.name])
Now I add a car with three colors, a LaFerrari in Red, Black and White. Having only one model of car, for search purposes there are 3 cars.
So I check Kibana and I get a normal output.
Then I perform a normal search: LaFerrari
All three models have the same info, changing only the color name on the text field. I've even tried removing the color from the text, and guess what I got.
After this fiasco, I tried the python elasticsearch library, and I got normal results (doing manual index and search), all three colors had the same score if I performed a search for LaFerrari.
Any idea what is going on?
I'm thinking about moving from haystack to plain elasticsearch, any recommendations?
If you want to search more distinctively you should add two more fields to the index:
color (and this is really the color like white however you name the models and attributes)
name (the brand name)
The catch-all document field will get you only so far. You would have to make it so that Elasticsearch uses a DisMax query and searches on all configured fields for the given search terms.
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-dis-max-query.html
I've only used the SearchQuerySet+Elastic (based on the catch-all field) so far (and custom+Solr a lot). While the SearchQuerySet fits in very nicely with the Django ORM it will only get you so far. So, you are probably right that you might have to use custom code for querying. I would still recommend Haystack for indexing though (it might be slower but very easy to setup and maintain).
Looking at your example, what you gain with different fields would be:
You search for Laferrari and this is the exact value found in all three documents in the field name (or brand_name). The results will then have the same scores.
Different fields also enable you to use facets: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-facets.html#search-facets

Modern methods for filtering a Django annotation?

I'd like to filter an annotation using the Django ORM. A lot of the articles I've found here at SO are fairly dated, targeting Django back in the 1.2 to 1.4 days:
Filtering only on Annotations in Django - This question from 2010 suggests using an extra clause, which isn't recommended by the official Django docs
Django annotation with nested filter - Similar suggestions are provided in this question from 2011.
Django 1.8 adds conditional aggregation, which seems like what I might want, but I can't quite figure out the syntax that I'll eventually need. Here are my models and the scenario I'm trying to reach (I've simplified the models for brevity's sake):
class Project(models.Model):
name = models.CharField()
... snip ...
class Milestone_meta(models.Model):
name = models.CharField()
is_cycle = models.BooleanField()
class Milestone(models.Model):
project = models.ForeignKey('Project')
meta = models.ForeignKey('Milestone_meta')
entry_date = models.DateField()
I want to get each Project (with all its fields), along with the Max(entry_date) and Min(entry_date) for each associated Milestone, but only for those Milestone records whose associated Milestone_meta has the is_cycle flag set to True. In other words:
For every Project record, give me the maximum and minimum Milestone entry_dates, but only when the associated Milestone_meta has a given flag set to True.
At the moment, I'm getting a list of projects, then getting the Max and Min Milestones in a loop, resulting in N+1 database hits (which gets slow, as you'd expect):
pqs = Projects.objects.all()
for p in pqs:
(theMin, theMax) = getMilestoneBounds(p)
# Use values from p and theMin and theMax
...
def getMilestoneBounds(pid):
mqs = Milestone.objects.filter(meta__is_cycle=True)
theData = mqs.aggregate(min_entry=Min('entry_date'),max_entry=Max('entry_date'))
return (theData['min_entry'], theData['max_entry'])
How can I reduce this to one or two queries?
As far as I know, you can not get all required project objects in one query.
However, if you don't need the objects and can work with just their id, one way would be-
Milestone.objects.filter(meta__is_cycle=True).values('project').annotate(min_entry=Min('entry_date')).annotate(max_entry=Max('entry_date'))
It will give a list of dicts having data of distinct projects, you can then use their 'id' to lookup the objects when needed.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.
Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.
lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module
I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])