Django Trigram: create gin index and search suggested words in Django

Django Trigram: create gin index and search suggested words in Django - django

I have model with title and description fields.
I want to create a GIN index for all the words in the title and description field
So I do it the following way using SQL:
STEP1: Create a table with all the words in title and description using simple config
CREATE TABLE words AS SELECT word FROM ts_stat('SELECT to_tsvector(''simple'',COALESCE("articles_article"."title", '''')) || to_tsvector(''simple'',COALESCE("articles_article"."description", '''')) FROM "articles_article"');
STEP2: Create GIN index
CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
STEP3: SEARCH
SELECT word, similarity(word, 'sri') AS sml
FROM words
WHERE word % 'sri'
ORDER BY sml DESC, word;
Result:
word sml
sri 1
srila 0.5
srimad 0.428571
How to do this in DJANGO and also i have to keep updating the GIN index

Django docs suggest that you install the relevant btree_gin_extension and append the following to the model's Meta class:
from django.contrib.postgres.indexes import GinIndex
class MyModel(models.Model):
the_field = models.CharField(max_length=512)
class Meta:
indexes = [GinIndex(fields=['the_field'])]
A relevant answer can be found here.
Regarding the updating of the index, heroku suggests:
Finally, indexes will become fragmented and unoptimized after some
time, especially if the rows in the table are often updated or
deleted. In those cases it may be required to perform a REINDEX
leaving you with a balanced and optimized index. However be cautious
about reindexing big indexes as write locks are obtained on the parent
table. One strategy to achieve the same result on a live site is to
build an index concurrently on the same table and columns but with a
different name, and then dropping the original index and renaming the
new one. This procedure, while much longer, won’t require any long
running locks on the live tables.

Related

Django queryset behind the scenes

**
Difference between creating a foreign key for consistency and for joins
**
I am fine to use Foreignkey and Queryset API with Django.
I just want to understand little bit more deeply how it works behind the scenes.
In Django manual, it says
a database index is automatically created on the ForeignKey. You can
disable this by setting db_index to False. You may want to avoid the
overhead of an index if you are creating a foreign key for consistency
rather than joins, or if you will be creating an alternative index
like a partial of multiple column index.
creating for a foreign key for consistency rather than joins
this part is confusing me.
I expected that you use Join keyword if you do query with Foreign key like below.
SELECT
*
FROM
vehicles
INNER JOIN users ON vehicles.car_owner = users.user_id
For example,
class Place(models.Model):
name = models.Charfield(max_length=50)
address = models.Charfield(max_length=50)
class Comment(models.Model):
place = models.ForeignKeyField(Place)
content = models.Charfield(max_length=50)
if you use queryset like Comment.objects.filter(place=1), i expected using Join Keyword in low level SQL command.
but, when I checked it by printing out queryset.query in console, it showed like below.
(I simplified with Model just to explains. below, it shows all attributes in my model. you can ignore attributes)
SELECT
"bfm_comment"."id", "bfm_comment"."content", "bfm_comment"."user_id", "bfm_comment"."place_id", "bfm_comment"."created_at"
FROM "bfm_comment" WHERE "bfm_comment"."place_id" = 1
creating a foreign key for consistency vs creating a foreign key for joins
simply, I thought if you use any queryset, it means using foreign key for joins. Because you can get parent's table data by c = Comment.objects.get(id=1) c.place.name easily. I thought it joins two tables behind scenes. But result of Print(queryset.query) didn't how Join Keyword but Find it by Where keyword.
The way I understood from an answer
Case 1:
Comment.objects.filter(place=1)
result
SELECT
"bfm_comment"."id", "bfm_comment"."content", "bfm_comment"."user_id", "bfm_comment"."place_id", "bfm_comment"."created_at"
FROM "bfm_comment"
WHERE "bfm_comment"."id" = 1
Case 2:
Comment.objects.filter(place__name="df")
result
SELECT "bfm_comment"."id", "bfm_comment"."content", "bfm_comment"."user_id", "bfm_comment"."place_id", "bfm_comment"."created_at"
FROM "bfm_comment" INNER JOIN "bfm_place" ON ("bfm_comment"."place_id" = "bfm_place"."id")
WHERE "bfm_place"."name" = df
Case1 is searching rows which has comment.id column is 1 in just Comment table.
But in Case 2, it needs to know Place table's attribute 'name', so It has to use JOIN keyword to check values in column of Place table. Right?
So Is it alright to think that I create a foreign key for joins if i use queryset like Case2 and that it is better to create index on the Foreign Key?
for above question, I think I can take the answer from Django Manual
Consider adding indexes to fields that you frequently query using
filter(), exclude(), order_by(), etc. as indexes may help to speed up
lookups. Note that determining the best indexes is a complex
database-dependent topic that will depend on your particular
application. The overhead of maintaining an index may outweigh any
gains in query speed
In conclusion, it really depends on how my application work with it.

If you execute the following command the mystery will be revealed
./manage.py sqlmigrate myapp 0001
Take care to replace myapp with your app name (bfm I think) and 0001 with the actual migration where the Comment model is created.
The generated sql will reveal that the actual table is created with place_id int rather than a place Place that is because the RDBMS doesn't know anything about models, the models are only in the application level. It's the job of the django orm to fetch the data from the RDBMS and convert them into model instances. That's why you always get a place member in each of your Comment instances and that place member gives you access to the members of the related Place instance in turn.
So what happens when you do?
Comment.objects.filter(place=1)
Django is smart enough to know that you are referring to a place_id because 1 is obviously not an instance of a Place. But if you used a Place instance the result would be the same. So there is no join here. The above query would definitely benefit from having an index on the place_id, but it wouldn't benefit from having a foreign key constraint!! Only the Comment table is queried.
If you want a join, try this:
Comment.objects.filter(place__name='my home')
Queries of this nature with the __ often result in joins, but sometimes it results in a sub query.

Querysets are lazy.
https://docs.djangoproject.com/en/1.10/topics/db/queries/#querysets-are-lazy
QuerySets are lazy – the act of creating a QuerySet doesn’t involve
any database activity. You can stack filters together all day long,
and Django won’t actually run the query until the QuerySet is
evaluated. Take a look at this example:

Django Postgres ArrayField aggregation and filtering

Following on from this question: Django Postgresql ArrayField aggregation
I have an ArrayField of Categories and I would like to retrieve all unique values it has - however the results should be filtered so that only values starting with the supplied string are returned.
What's the "most Django" way of doing this?
Given an Animal model that looks like this:
class Animal(models.Model):
# ...
categories = ArrayField(
models.CharField(max_length=255, blank=True),
default=list,
)
# ...
Then, as per the other question's answer, this works for finding all categories, unfiltered.
all_categories = (
Animal.objects
.annotate(categories_element=Func(F('categories'), function='unnest'))
.values_list('categories_element', flat=True)
.distinct()
)
However, now, when I attempt to filter the result I get failure, not just with __startswith but all types of filter:
all_categories.filter(categories_element__startswith('ga'))
all_categories.filter(categories_element='dog')
Bottom of stacktrace is:
DataError: malformed array literal: "dog"
...
DETAIL: Array value must start with "{" or dimension information.
... and it appears that it's because Django tries to do a second UNNEST - this is the SQL it generates:
...) WHERE unnest("animal"."categories") = dog::text[]
If I write the query in PSQL then it appears to require a subquery as a result of the UNNEST:
SELECT categories_element
FROM (
SELECT UNNEST(animal.categories) as categories_element
) ul
WHERE ul.categories_element like 'Ga%';
Is there a way to get Django ORM to make a working query? Or should I just give up on the ORM and use raw SQL?

You probably have the wrong database design.
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.
http://www.postgresql.org/docs/9.1/static/arrays.html

using two xpathselectors on the same page

I have a spider where the scraped items are 3: brand, model and price from the same page.
Brands and models are using the same sel.xpath, later extracted and differentiated by .re in loop. However, price item is using different xpath. How can I use or combine two XPathSelectors in the spider?
Examples:
for brand and model:
titles = sel.xpath('//table[#border="0"]//td[#class="compact"]')
for prices:
prices = sel.xpath('//table[#border="0"]//td[#class="cl-price-cont"]//span[4]')
Tested and exported individually by xpath. My problem is the combining these 2 to construct the proper loop.
Any suggestions?
Thanks!

Provided you can differentiate all 3 kind of items (brand, model, price) later, you can try using XPath union (|) to bundle both XPath queries into one selector :
//table[#border="0"]//td[#class="compact"]
|
//table[#border="0"]//td[#class="cl-price-cont"]//span[4]
UPDATE :
Responding your comment, above meant to be single XPath string. I'm not using python, but I think it should be about like this :
sel.xpath('//table[#border="0"]//td[#class="compact"] | //table[#border="0"]//td[#class="cl-price-cont"]//span[4]')

I believe you are having trouble associating the price with the make/model because both xpaths give you a list of all numbers, correct? Instead, what you want to do is build an xpath that will get you each row of the table. Then, in your loop, you can do further xpath queries to pull out the make/model/price.
rows = sel.xpath('//table[#border="0"]/tr') # Get all the rows
for row in rows:
make_model = row.xpath('//td[#class="compact"]/text()').extract()
# set make and model here using your regex. something like:
(make,model) = re("^(.+?)\s(.+?)$", make_model).groups()
price = row.xpath('//td[#class="cl-price-cont"]//span[4]/text()').extract()
# do something with the make/model/price.
This way, you know that in each iteration of the loop, the make/model/price you're getting all go together.

Filtering on the concatenation of two model fields in django

With the following Django model:
class Item(models.Model):
name = CharField(max_len=256)
description = TextField()
I need to formulate a filter method that takes a list of n words (word_list) and returns the queryset of Items where each word in word_list can be found, either in the name or the description.
To do this with a single field is straightforward enough. Using the reduce technique described here (this could also be done with a for loop), this looks like:
q = reduce(operator.and_, (Q(description__contains=word) for word in word_list))
Item.objects.filter(q)
I want to do the same thing but take into account that each word can appear either in the name or the description. I basically want to query the concatenation of the two fields, for each word. Can this be done?
I have read that there is a concatenation operator in Postgresql, || but I am not sure if this can be utilized somehow in django to achieve this end.
As a last resort, I can create a third column that contains the combination of the two fields and maintain it via post_save signal handlers and/or save method overrides, but I'm wondering whether I can do this on the fly without maintaining this type of "search index" type of column.

The most straightforward way would be to use Q to do an OR:
lookups = [Q(name__contains=word) | Q(description__contains=word)
for word in words]
Item.objects.filter(*lookups) # the same as and'ing them together
I can't speak to the performance of this solution as compared to your other two options (raw SQL concatenation or denormalization), but it's definitely simpler.

Django DB, finding Categories whose Items are all in a subset

I have a two models:
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I am trying to return all Categories for which all of that category's items belong to a given subset of item ids (fixed thanks). For example, all categories for which all of the items associated with that category have ids in the set [1,3,5].
How could this be done using Django's query syntax (as of 1.1 beta)? Ideally, all the work should be done in the database.

Category.objects.filter(item__id__in=[1, 3, 5])
Django creates the reverse relation ship on the model without the foreign key. You can filter on it by using its related name (usually just the model name lowercase but it can be manually overwritten), two underscores, and the field name you want to query on.

lets say you require all items to be in the following set:
allowable_items = set([1,3,4])
one bruteforce solution would be to check the item_set for every category as so:
categories_with_allowable_items = [
category for category in
Category.objects.all() if
set([item.id for item in category.item_set.all()]) <= allowable_items
]
but we don't really have to check all categories, as categories_with_allowable_items is always going to be a subset of the categories related to all items with ids in allowable_items... so that's all we have to check (and this should be faster):
categories_with_allowable_items = set([
item.category for item in
Item.objects.select_related('category').filter(pk__in=allowable_items) if
set([siblingitem.id for siblingitem in item.category.item_set.all()]) <= allowable_items
])
if performance isn't really an issue, then the latter of these two (if not the former) should be fine. if these are very large tables, you might have to come up with a more sophisticated solution. also if you're using a particularly old version of python remember that you'll have to import the sets module

I've played around with this a bit. If QuerySet.extra() accepted a "having" parameter I think it would be possible to do it in the ORM with a bit of raw SQL in the HAVING clause. But it doesn't, so I think you'd have to write the whole query in raw SQL if you want the database doing the work.
EDIT:
This is the query that gets you part way there:
from django.db.models import Count
Category.objects.annotate(num_items=Count('item')).filter(num_items=...)
The problem is that for the query to work, "..." needs to be a correlated subquery that looks up, for each category, the number of its items in allowed_items. If .extra had a "having" argument, you'd do it like this:
Category.objects.annotate(num_items=Count('item')).extra(having="num_items=(SELECT COUNT(*) FROM app_item WHERE app_item.id in % AND app_item.cat_id = app_category.id)", having_params=[allowed_item_ids])

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js