Partial matching search in Wagtail with Postgres - django

I've got a wagtail site powered by Postgres and would like to implement a fuzzy search on all documents. However, according to wagtail docs "SearchField(partial_match=True) is not handled." Does anyone know of a way I can implement my own partial matching search?
I'm leaving this question intentionally open-ended because I'm open to pretty much any solution that works well and is fairly scalable.

We’re currently rebuilding the Wagtail search API in order to make autocomplete usable roughly the same way across backends.
For now, you can use directly the IndexEntry model that stores search data. Unfortunately, django.contrib.postgres.search does not contain a way to do an autocomplete query, so we have to do it ourselves for now. Here is how to do that:
from django.contrib.postgres.search import SearchQuery
from wagtail.contrib.postgres_search.models import IndexEntry
class SearchAutocomplete(SearchQuery):
def as_sql(self, compiler, connection):
return "to_tsquery(''%s':*')", [self.value]
query = SearchAutocomplete('postg')
print(IndexEntry.objects.filter(body_search=query).rank(query))
# All results containing words starting with “postg”
# should be displayed, sorted by relevance.

It doesn't seem to be documented yet, but the gist of autocomplete filtering with Postgres, using a request object, is something like
from django.conf import settings
from wagtail.search.backends import get_search_backend
from wagtail.search.backends.base import FilterFieldError, OrderByFieldError
def filter_queryset(queryset, request):
search_query = request.GET.get("search", "").strip()
search_enabled = getattr(settings, 'WAGTAILAPI_SEARCH_ENABLED', True)
if 'search' in request.GET and search_query:
if not search_enabled:
raise BadRequestError("search is disabled")
search_operator = request.GET.get('search_operator', None)
order_by_relevance = 'order' not in request.GET
sb = get_search_backend()
try:
queryset = sb.autocomplete(search_query, queryset, operator=search_operator, order_by_relevance=order_by_relevance)
except FilterFieldError as e:
raise BadRequestError("cannot filter by '{}' while searching (field is not indexed)".format(e.field_name))
except OrderByFieldError as e:
raise BadRequestError("cannot order by '{}' while searching (field is not indexed)".format(e.field_name))
The line to note is the call to sb.autocomplete.
If you want to use custom fields with autocomplete, you'll also need to add them into search_fields as an AutocompleteField in addition to a SearchField -- for example
search_fields = Page.search_fields + [
index.SearchField("field_to_index", partial_match=True)
index.AutocompleteField("field_to_index", partial_match=True),
...
This solution is working for Wagtail 2.3. If you using an older version, it is unlikely to work, and if you are using a future version, hopefully the details will be incorporated into the official documents, which currently state that autocomplete with Postgres is NOT possible. Thankfully, that has turned out to not be true, due to the work of Bertrand Bordage in the time since he wrote the other answer.

Related

Problem with Django Tests and Trigram Similarity

I have a Django application that executes a full-text-search on a database. The view that executes this query is my search_view (I'm ommiting some parts for the sake of simplicity). It just retrieve the results of the search on my Post model and send to the template:
def search_view(request):
posts = m.Post.objects.all()
query = request.GET.get('q')
search_query = SearchQuery(query, config='english')
qs = Post.objects.annotate(
rank=SearchRank(F('vector_column'), search_query) + TrigramSimilarity('post_title', query)
).filter(rank__gte=0.15).order_by('-rank'), 15
)
context = {
results = qs
}
return render(request, 'core/search.html', context)
The application is working just fine. The problem is with a test I created. Here is my tests.py:
class SearchViewTests(TestCase):
def test_search_without_results(self):
"""
If the user's query did not retrieve anything
show him a message informing that
"""
response = self.client.get(reverse('core:search') + '?q=eksjeispowjskdjies')
self.assertEqual(response.status_code, 200)
self.assertContains(response.content, "We didn\'t find anything on our database. We\'re sorry")
This test raises an ProgrammingError exception:
django.db.utils.ProgrammingError: function similarity(character varying, unknown) does not exist
LINE 1: ...plainto_tsquery('english'::regconfig, 'eksjeispowjskdjies')) + SIMILARITY...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
I understand very well this exception, 'cause I got it sometimes. The SIMILARITY function in Postgres accepts two arguments, and both need to be of type TEXT. The exception is raising because the second argument (my query term) is of type UNKNOWN, therefore the function won't work and Django raises the exception. And I don't understand why, because the actual search is working! Even in the shell it works perfectly:
In [1]: from django.test import Client
In [2]: c = Client()
In [3]: response = c.get(reverse('core:search') + '?page=1&q=eksjeispowjskdjies')
In [4]: response
Out[4]: <HttpResponse status_code=200, "text/html; charset=utf-8">
Any ideas about why test doesn't work, but the actual execution of the app works and console test works too?
I had the same problem and this how I solved it in my case:
First of all, the problem was that when Django creates the test database that it is going to use for tests it does not actually run all of your migrations. It simply creates the tables based on your models.
This means that migrations that create some extension in your database, like pg_trgm do not run when creating the test database.
One way to overcome this is to use a fixture in your conftest.py file which will make create said extensions before any tests run.
So, in your conftest.py file add the following:
# the following fixture is used to add the pg_trgm extension to the test database
#pytest.fixture(scope="session", autouse=True)
def django_db_setup(django_db_setup, django_db_blocker):
"""Test session DB setup."""
with django_db_blocker.unblock():
with connection.cursor() as cursor:
cursor.execute("CREATE EXTENSION IF NOT EXISTS pg_trgm;")
You can of course replace pg_trgm with any other extension you require.
PS: You must make sure the extension you are trying to use works for the test database you have chosen. In order to change the database used by Django you can change the value of
DATABASES = {'default': env.db('your_database_connection_uri')}
in your application's settings.py.

Unable to search part of a term with Django Haystack and Elasticsearch

I have a project based on
Django==1.9.2
django-haystack==2.4.1
elasticsearch==2.2.0
A very simple search view:
def search_view(request):
query = request.GET.get('q', '')
sqs = SearchQuerySet().filter(content=query)
params = {
'results': sqs,
'query': query,
}
return render_to_response('results.html', params,
context_instance=RequestContext(request))
My search index is as simple as:
class CategoryIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name = indexes.CharField(model_attr='name')
def get_model(self):
return Category
def index_queryset(self, using=None):
return self.get_model().objects.filter(published=True)
The category_text.txt file is just:
{{ object.name }}
In my database I have a few items:
Acqua
Acquario
Aceto
Accento
When I search with my view, I have strange behaviours.
Searching with query "ac" I receive no results! I was expecting to have all my items. I have tryed to change the query using .filter(content__contains=query) (I know it is the default!) but nothing changed.
Searching with query "acqua" I receive 1 result (correct) with the result object, but when I try to print it, the result.object field is None (the other fields contain the correct information).
What am I doing wrong?
Thank you.
UPDATE
I have found a solution to my problem number 2. Latest Haystack version from PyPi is not Django 1.9.x compatible.
I have just added -e git+https://github.com/django-haystack/django-haystack.git#egg=django-haystack to my requirements.txt file and the issue is fixed. More info about that on GitHub: https://github.com/django-haystack/django-haystack/issues/1291
The other issues is still opened and I cannot find any solution to it.
It sounds like you may be running into a minimum number of characters issue for #1. Take a look at the Haystack documents for autocomplete which shows an approach using EdgeNgramField instead of the typical CharField.

Django Haystack custom SearchView for pretty urls

I'm trying to setup Django Haystack to search based on some pretty urls. Here is my urlpatterns.
urlpatterns += patterns('',
url(r'^search/$', SearchView(),
name='search_all',
),
url(r'^search/(?P<category>\w+)/$', CategorySearchView(
form_class=SearchForm,
),
name='search_category',
),
)
My custom SearchView class looks like this:
class CategorySearchView(SearchView):
def __name__(self):
return "CategorySearchView"
def __call__(self, request, category):
self.category = category
return super(CategorySearchView, self).__call__(request)
def build_form(self, form_kwargs=None):
data = None
kwargs = {
'load_all': self.load_all,
}
if form_kwargs:
kwargs.update(form_kwargs)
if len(self.request.GET):
data = self.request.GET
kwargs['searchqueryset'] = SearchQuerySet().models(self.category)
return self.form_class(data, **kwargs)
I keep getting this error running the Django dev web server if I try and visit /search/Vendor/q=Microsoft
UserWarning: The model u'Vendor' is not registered for search.
warnings.warn('The model %r is not registered for search.' % model)
And this on my page
The model being added to the query must derive from Model.
If I visit /search/q=Microsoft, it works fine. Is there another way to accomplish this?
Thanks for any pointers
-Jay
There are a couple of things going on here. In your __call__ method you're assigning a category based on a string in the URL. In this error:
UserWarning: The model u'Vendor' is not registered for search
Note the unicode string. If you got an error like The model <class 'mymodel.Model'> is not registered for search then you'd know that you haven't properly created an index for that model. However this is a string, not a model! The models method on the SearchQuerySet class requires a class instance, not a string.
The first thing you could do is use that string to look up a model by content type. This is probably not a good idea! Even if you don't have models indexed which you'd like to keep away from prying eyes, you could at least generate some unnecessary errors.
Better to use a lookup in your view to route the query to the correct model index, using conditionals or perhaps a dictionary. In your __call__ method:
self.category = category.lower()
And if you have several models:
my_querysets = {
'model1': SearchQuerySet().models(Model1),
'model2': SearchQuerySet().models(Model2),
'model3': SearchQuerySet().models(Model3),
}
# Default queryset then searches everything
kwargs['searchqueryset'] = my_querysets.get(self.category, SearchQuerySet())

How to limit file types on file uploads for ModelForms with FileFields?

My goal is to limit a FileField on a Django ModelForm to PDFs and Word Documents. The answers I have googled all deal with creating a separate file handler, but I am not sure how to do so in the context of a ModelForm. Is there a setting in settings.py I may use to limit upload file types?
Create a validation method like:
def validate_file_extension(value):
if not value.name.endswith('.pdf'):
raise ValidationError(u'Error message')
and include it on the FileField validators like this:
actual_file = models.FileField(upload_to='uploaded_files', validators=[validate_file_extension])
Also, instead of manually setting which extensions your model allows, you should create a list on your setting.py and iterate over it.
Edit
To filter for multiple files:
def validate_file_extension(value):
import os
ext = os.path.splitext(value.name)[1]
valid_extensions = ['.pdf','.doc','.docx']
if not ext in valid_extensions:
raise ValidationError(u'File not supported!')
Validating with the extension of a file name is not a consistent way. For example I can rename a picture.jpg into a picture.pdf and the validation won't raise an error.
A better approach is to check the content_type of a file.
Validation Method
def validate_file_extension(value):
if value.file.content_type != 'application/pdf':
raise ValidationError(u'Error message')
Usage
actual_file = models.FileField(upload_to='uploaded_files', validators=[validate_file_extension])
An easier way of doing it is as below in your Form
file = forms.FileField(widget=forms.FileInput(attrs={'accept':'application/pdf'}))
Django since 1.11 has a FileExtensionValidator for this purpose:
class SomeDocument(Model):
document = models.FileFiled(validators=[
FileExtensionValidator(allowed_extensions=['pdf', 'doc'])])
As #savp mentioned, you will also want to customize the widget so that users can't select inappropriate files in the first place:
class SomeDocumentForm(ModelForm):
class Meta:
model = SomeDocument
widgets = {'document': FileInput(attrs={'accept': 'application/pdf,application/msword'})}
fields = '__all__'
You may need to fiddle with accept to figure out exactly what MIME types are needed for your purposes.
As others have mentioned, none of this will prevent someone from renaming badstuff.exe to innocent.pdf and uploading it through your form—you will still need to handle the uploaded file safely. Something like the python-magic library can help you determine the actual file type once you have the contents.
For a more generic use, I wrote a small class ExtensionValidator that extends Django's built-in RegexValidator. It accepts single or multiple extensions, as well as an optional custom error message.
class ExtensionValidator(RegexValidator):
def __init__(self, extensions, message=None):
if not hasattr(extensions, '__iter__'):
extensions = [extensions]
regex = '\.(%s)$' % '|'.join(extensions)
if message is None:
message = 'File type not supported. Accepted types are: %s.' % ', '.join(extensions)
super(ExtensionValidator, self).__init__(regex, message)
def __call__(self, value):
super(ExtensionValidator, self).__call__(value.name)
Now you can define a validator inline with the field, e.g.:
my_file = models.FileField('My file', validators=[ExtensionValidator(['pdf', 'doc', 'docx'])])
I use something along these lines (note, "pip install filemagic" is required for this...):
import magic
def validate_mime_type(value):
supported_types=['application/pdf',]
with magic.Magic(flags=magic.MAGIC_MIME_TYPE) as m:
mime_type=m.id_buffer(value.file.read(1024))
value.file.seek(0)
if mime_type not in supported_types:
raise ValidationError(u'Unsupported file type.')
You could probably also incorporate the previous examples into this - for example also check the extension/uploaded type (which might be faster as a primary check than magic.) This still isn't foolproof - but it's better, since it relies more on data in the file, rather than browser provided headers.
Note: This is a validator function that you'd want to add to the list of validators for the FileField model.
I find that the best way to check the type of a file is by checking its content type. I will also add that one the best place to do type checking is in form validation. I would have a form and a validation as follows:
class UploadFileForm(forms.Form):
file = forms.FileField()
def clean_file(self):
data = self.cleaned_data['file']
# check if the content type is what we expect
content_type = data.content_type
if content_type == 'application/pdf':
return data
else:
raise ValidationError(_('Invalid content type'))
The following documentation links can be helpful:
https://docs.djangoproject.com/en/3.1/ref/files/uploads/ and https://docs.djangoproject.com/en/3.1/ref/forms/validation/
I handle this by using a clean_[your_field] method on a ModelForm. You could set a list of acceptable file extensions in settings.py to check against in your clean method, but there's nothing built-into settings.py to limit upload types.
Django-Filebrowser, for example, takes the approach of creating a list of acceptable file extensions in settings.py.
Hope that helps you out.

Sphinx documentation on Django templatetags

I am using Sphinx with autodoc for documenting my Django project.
Design guys want to have documentation page(s) about all the template tags that are defined in the project. Of course, I can make such a page by enumerating all the template processing functions by hand, but I think it is not DRY, isn't it? In fact, template tag processing functions are all marked with #register.inclusion_tag decorator. So it seems to be possible and natural for some routine to collect them all and put into documentation.
The same about filter functions.
I've googled it, searched Django documentation but in vein. I can hardly believe that such a natural functionality hasn't been implemented by someone.
I didn't stop at this point and implemented a Sphinx autodoc extension.
Snippet 2. Sphinx autodoc extension
"""
Extension of Sphinx autodoc for Django template tag libraries.
Usage:
.. autotaglib:: some.module.templatetags.mod
(options)
Most of the `module` autodoc directive flags are supported by `autotaglib`.
Andrew "Hatter" Ponomarev, 2010
"""
from sphinx.ext.autodoc import ModuleDocumenter, members_option, members_set_option, bool_option, identity
from sphinx.util.inspect import safe_getattr
from django.template import get_library, InvalidTemplateLibrary
class TaglibDocumenter(ModuleDocumenter):
"""
Specialized Documenter subclass for Django taglibs.
"""
objtype = 'taglib'
directivetype = 'module'
content_indent = u''
option_spec = {
'members': members_option, 'undoc-members': bool_option,
'noindex': bool_option,
'synopsis': identity,
'platform': identity, 'deprecated': bool_option,
'member-order': identity, 'exclude-members': members_set_option,
}
#classmethod
def can_document_member(cls, member, membername, isattr, parent):
# don't document submodules automatically
return False
def import_object(self):
"""
Import the taglibrary.
Returns True if successful, False if an error occurred.
"""
# do an ordinary module import
if not super(ModuleDocumenter, self).import_object():
return False
try:
# ask Django if specified module is a template tags library
# and - if it is so - get and save Library instance
self.taglib = get_library(self.object.__name__)
return True
except InvalidTemplateLibrary, e:
self.taglib = None
self.directive.warn(unicode(e))
return False
def get_object_members(self, want_all):
"""
Decide what members of current object must be autodocumented.
Return `(members_check_module, members)` where `members` is a
list of `(membername, member)` pairs of the members of *self.object*.
If *want_all* is True, return all members. Else, only return those
members given by *self.options.members* (which may also be none).
"""
if want_all:
return True, self.taglib.tags.items()
else:
memberlist = self.options.members or []
ret = []
for mname in memberlist:
if mname in taglib.tags:
ret.append((mname, self.taglib.tags[mname]))
else:
self.directive.warn(
'missing templatetag mentioned in :members: '
'module %s, templatetag %s' % (
safe_getattr(self.object, '__name__', '???'), mname))
return False, ret
def setup(app):
app.add_autodocumenter(TaglibDocumenter)
This extension defines Sphinx directive autotaglib which behaves much like automodule, but enumerates only tag implementing functions.
Example:
.. autotaglib:: lib.templatetags.bfmarkup
:members:
:undoc-members:
:noindex:
For the record, Django has an automatic documentation system (add django.contrib.admindocs to your INSTALLED_APPS).
This will give you extra views in the admin (usually at /admin/docs/) that represent your models, views (based on the URL), template tags and template filters.
More documentation for this can be found in the admindocs section.
You can take a look at that code to include it in your documentation or at the extensions for the Django documentation.
I solved the problem and would like to share my snippets - in case they will be useful to someone.
Snippet 1. Simple documenter
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'project.settings'
from django.template import get_library
def show_help(libname):
lib = get_library(libname)
print lib, ':'
for tag in lib.tags:
print tag
print lib.tags[tag].__doc__
if __name__ == '__main__':
show_help('lib.templatetags.bfmarkup')
Before running this script you should setup PYTHONPATH environment variable.