Inexact full-text search in PostgreSQL and Django

Inexact full-text search in PostgreSQL and Django - django

I'm new to PostgreSQL, and I'm not sure how to go about doing an inexact full-text search. Not that it matters too much, but I'm using Django. In other words, I'm looking for something like the following:
q = 'hello world'
queryset = Entry.objects.extra(
where=['body_tsv ## plainto_tsquery(%s)'],
params=[q])
for entry in queryset:
print entry.title
where I the list of entries should contain either exactly 'hello world', or something similar. The listings should then be ordered according to how far away their value is from the specified string. For instance, I would like the query to include entries containing "Hello World", "hEllo world", "helloworld", "hell world", etc., with some sort of ranking indicating how far away each item is from the perfect, unchanged query string.
How would you go about doing this?

Your best bet is to use Django raw querysets, I use it with MySQL to perform full text matching. If the data is all in the database and Postgres provides the matching capability then it makes sense to use it. Plus Postgres offers some really useful things in terms of stemming etc with full text queries.
Basically it lets you write the actual query you want yet returns models (as long as you are querying a model table obviously).
The advantage this gives you is that you can test the exact query you will be using first in Postgres, the documentation covers full text queries pretty well.
The main gotcha with raw querysets at the moment is they don't support count. So if you will be returning lots of data and have memory constraints on your application you might need to do something clever.
"Inexact" matching however isn't really part of the full text searching capabilities. Instead you want the postgres fuzzystrmatch contrib module. It's use is described here with indexes.

The best would be to use a search engine for this purpose. Django-haystack supports the integration of three different search engines.

In 2022, Django supports full text search with postgres. Full documentation here: https://docs.djangoproject.com/en/4.0/ref/contrib/postgres/search/

Related

Querying Django Postgres for Fuzzy FTS + Typeahead

I'm using PostgreSQL 12.3 with Django 3.0.8. I'm trying to implement a typeahead functionality. I tried using Trigram similarity, but that didn't work too well since, if I were to have a text saying "chair," I would have to type "cha" before I get results.
I'm using corejavascript/twitter's typeahead JS library. Their first example (The Basics) allows you to type in a single letter and get results instantly, some of which are not even the prefix of the word.
How would I do that? You can view my current functionality here: https://github.com/Donate-Anything/Donate-Anything/blob/76348e9362d386d3d6375b9a75d47d5765960992/donate_anything/item/views.py#L21-L30
I thought that I should use to_tsvector with a SearchVectorField, but when I implemented that (with a query like this:
queryset = (
Item.objects.defer("is_appropriate")
.filter(name_search=str(query))
)
then I don't even see the results until I type in the full word. What would my implementation look like then? (Article for implementing the triggers for SearchVectorField)

What is the difference between annotations and regular lookups using Django's JSONField?

You can query Django's JSONField, either by direct lookup, or by using annotations. Now I realize if you annotate a field, you can all sorts of complex queries, but for the very basic query, which one is actually the preferred method?
Example: Lets say I have model like so
class Document(models.Model):
data = JSONField()
And then I store an object using the following command:
>>> Document.objects.create(data={'name': 'Foo', 'age': 24})
Now, the query I want is the most basic: Find all documents where data__name is 'Foo'. I can do this 2 ways, one using annotation, and one without, like so:
>>> from django.db.models.expressions import RawSQL
>>> Document.objects.filter(data__name='Foo')
>>> Document.objects.annotate(name = RawSQL("(data->>'name')::text", [])).filter(name='Foo')
So what exactly is the difference? And if I can make basic queries, why do I need to annotate? Provided of course I am not going to make complex queries.

There is no reason whatsoever to use raw SQL for queries where you can use ORM syntax. For someone who is conversant in SQL but less experienced with Django's ORM, RawSQL might provide an easier path to a certain result than the ORM, which has its own learning curve.
There might be more complex queries where the ORM runs into problems or where it might not give you the exact SQL query that you need. It is in these cases that RawSQL comes in handy – although the ORM is getting more feature-complete with every iteration, with
Cast (since 1.10),
Window functions (since 2.0),
a constantly growing array of wrappers for database functions
the ability to define custom wrappers for database functions with Func expressions (since 1.8) etc.

They are interchangable so it's matter of taste. I think Document.objects.filter(data__name='Foo') is better because:
It's easier to read
In the future, MariaDB or MySql can support JSON fields and your code will be able to run on both PostgreSQL and MariaDB.
Don't use RawSQL as a general rule. You can create security holes in your app.

Modifying field with regex in Mongo and adding it to a new field

I'm a mongo noob and have what I hope is a pretty easy question. I received a 100gb .bson file yesterday and need to quickly retrieve some documents associated with urls. Unfortunately, the people that managed the database decided to change the schema for storing urls halfway through its life. This means that the url field must be queried via regex and cannot be indexed.
What I am hoping to do is this: regex out some common string between the two versions of urls and store it in a new field called url_id. This field could then be indexed to make for quicker queries. Looking through some past SO posts i cobbled together some pseudo-code that might do the trick:
//pseudo code, i dont know javascript that well.
db.eval(function() {
db.foo.find({}, {url:1}).forEach(function(e) {
match = e.url.match(/.*(domain.com/.*)?(\\?.*)/); //remove http, www, and query strings
e.url_id = matches[1];
db.foo.save(e);
});
});
Then I could run:
db.foo.ensureIndex({url_id:1})
Which would create a new index that would be quicker to query by so long as I properly modified the urls before querying for them.
However, I'm scared at the prospect of running a for loop across 100gb of records. Is there a better way to do this that I'm not thinking of?

Figured out a workaround...
By simply scripting the modification of the input url to create various versions of itself, I was able to run multiple queries on the indexed database and concatenate the results. Hacky but it worked!

Django Haystack similarity search

I'm a Django newbie doing a primitive website. I installed haystack and Whoosh as its search engine cause it was the simplest thing to do. It works fine, but there is a problem and I don't know how to Google it. I have some categories on my site and I have indexed their names to search. So, when a user enters "Computing" it finds the computing category and links to it. But there is a problem. If a user enters "Comp" into search field, it doesn't find "Computing" at all. Is this something that can be configured and how?
EDIT:
What else have I tried? Installing haystack 2.0, following this tutorial, installing solr instead of whoosh, trying Ngram fields, rebuilding indexes 10 times, rewriting search_indexes.py. Everything. Doesn't work. If I type in Comp, it doesn't find Computing. Is there anything else I could do? I have noticed that in the tutorial above, everything works like a charm instantly.

When you do the usual:
SearchQuerySet().filter(title='Computing')
in Haystack 1.x, it filters on everything exactly matching 'Computing'.
You can change that behaviour by using Haystack's Field Lookups, for example, using 'contains' will filter on anything containing the given string (Computing, Utingcomp, Comp):
SearchQuerySet().filter(title__contains='Comp')
In Haystack 2.x, the default filter is 'contains', so it should behave as you would expect it to "out-of-the-box"

Check out the documentation on autocomplete. You need to setup your indices to support Ngram's, but this should be exactly what you need.
from haystack.query import SearchQuerySet
SearchQuerySet().autocomplete(content_auto='old')
# Result match things like 'goldfish', 'cuckold' & 'older'.

So, if I'm understanding, what you're looking for is the equivalent of 'LIKE' in SQL.
The problem is search engines that back Haystack aren't like an RDBMS.
The low level implementation of this filter will involve using wildcard characters but most of the Haystack backends don't support a leading wildcard, something required for an icontains/endswith filter. However, since most backends support trailing wildcards, Haystack 2.x includes a startswith filter. The only case this doesn't handle is searching for the end of a word, which doesn't look to be possible.
So, if you have indexed:
"Look at our great discounts in Computer section"
Then the following Haystack query DO match:
SearchQuerySet().filter(title__startswith='comp')
# match!
Notice the difference between Django vs. Haystack startswith filters. Django startswith will match at the beginning of the complete sentence (i.e. a CharField), but the Haystack one will match at the beginning of a token (i.e. each word in a complete sentence).
Hope it helps!

How do you access/configure summaries/snippets in Django Haystack

I'm working on getting django-haystack set up on my site, and am trying to have snippets in my search results roughly like so:
Title of result one about Wikis ...this special thing about wiki values is that...I always use a wiki when I walk...snippet value three talks about wikis too...and here's another snippet value
about wikis.
I know there's a template tag that uses Haystack code to do the the highlighting, but the snippets it generates are pretty limited:
they always start with the query word
there's only one snippet value
they don't support asterisk queries
and other stuff?
Is there a way to use the Solr backend to generate proper snippets as shown above?

Bottom line is that the Solr highlighting can't really be used by Haystack in a flexible way. I spoke to the main developer for Haystack on IRC, and he said basically, if I want to have the kind of highlighting I'm looking for, the only way to get it is to extend the Solr backend that Haystack uses.
I dabbled in that for about half a day, but couldn't get Haystack to recognize my custom back end. Haystack has some magic backend loading code that just wasn't working with me.
Consequently, I've switched over to sunburnt, which provides a lighter-weight and more extensible wrapper around Solr. I'm hoping it will fare better.

from haystack.utils import Highlighter
my_text = 'This is a sample block that would be more meaningful in real life.'
my_query = 'block meaningful'
highlight = Highlighter(my_query)
highlight.highlight(my_text)
http://docs.haystacksearch.org/dev/highlighting.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Inexact full-text search in PostgreSQL and Django - django

The best would be to use a search engine for this purpose. Django-haystack supports the integration of three different search engines.

In 2022, Django supports full text search with postgres. Full documentation here: https://docs.djangoproject.com/en/4.0/ref/contrib/postgres/search/

Related

Querying Django Postgres for Fuzzy FTS + Typeahead

What is the difference between annotations and regular lookups using Django's JSONField?

Modifying field with regex in Mongo and adding it to a new field

Django Haystack similarity search

How do you access/configure summaries/snippets in Django Haystack

Categories

Resources