I have a django based app with haystack and whoosh search engine. I want to provide an accent and special character independent search so that I can find indexed data with special characters also by using words without special chars:
Indexed is:
'café'
Search term:
'cafe'
'café'
I've written a provided a specific FoldingWhooshSearchBackend which uses a StemmingAnalyzer and aCharsetFilter(accent_map) as described in the following document:
https://gist.github.com/gregplaysguitar/1727204
However the search still doesn't work like expected, i.e. I cannot search with 'cafe' and find 'café'. I've looked into the search index using:
from whoosh.index import open_dir
ix = open_dir('myservice/settings/whoosh_index')
searcher = ix.searcher()
for doc in searcher.documents():
print doc
The special characters are still in the index.
Do I have to do something additional? Is is about changing the index template?
You have to write Haystack SearchIndex classes for your models. That's how you can prepare models data for the search index.
Example of myapp/search_index.py:
from haystack import site
from haystack import indexes
class UserProfileIndex(indexes.SearchIndex):
text = indexes.CharField(document=True)
def prepare_text(self, obj):
data = [obj.get_full_name(), obj.user.email, obj.phone]
original = ' '.join(data)
slugified = slugify(original)
return ' '.join([original, slugified])
site.register(UserProfile, UserProfileIndex)
If a user has name café, you will find his profile with bouth search terms café and cafe.
I think the best approach is to let Haystack create the schema for maximum forwards compatibility, and then hack the CharsetFilter in.
This code is working for me with Haystack 2.4.0 and Whoosh 2.7.0:
from haystack.backends.whoosh_backend import WhooshEngine, WhooshSearchBackend
from whoosh.analysis import CharsetFilter, StemmingAnalyzer
from whoosh.support.charset import accent_map
from whoosh.fields import TEXT
class FoldingWhooshSearchBackend(WhooshSearchBackend):
def build_schema(self, fields):
schema = super(FoldingWhooshSearchBackend, self).build_schema(fields)
for name, field in schema[1].items():
if isinstance(field, TEXT):
field.analyzer = StemmingAnalyzer() | CharsetFilter(accent_map)
return schema
class FoldingWhooshEngine(WhooshEngine):
backend = FoldingWhooshSearchBackend
Related
We are using axios to pass GET request to our django instance, that splits it into search terms and runs the search. This has been working fine until we ran into an edge case. We use urlencode to ensure that strings do not have empty spaces or others
So generalize issue, we have TextField called "name" and we want to search for term "A & B Company". However, issue is that when the request reaches django.
What we expected was that name=A%20&%20B%20Company&field=value would be parsed as name='A & B Company' and field='value'.
Instead, it is parsed as name='A ' 'B Company' and field='value'. The & symbol is incorrectly treated as separator, despite being encoded.
Is there a way to indicate django GET parameter that certain & symbols are part of the value, instead of separators for fields?
You can use the lib urllib
class ModelExample(models.Model):
name = models.TextField()
# in view...
from urllib.parse import parse_qs
instance = ModelExample(name="name=A%20&%20B%20Company&field=value")
dict_qs = parse_qs(instance.name)
dict_qs contains a dict with decoded querystring
You can find more informations about urllib.parse here: https://docs.python.org/3/library/urllib.parse.html
I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.
I am developing a web application for managing customers. So I have a Customer entity which is made up by usual fields such as first_name, last_name, age etc.
I have a page where these customers are shown as a table. In the same page I have a search field, and I'd like to filter customers and update the table while the user is typing a something in the search field, using Ajax.
Here is how it should work:
Figure 1: The main page showing all of the customers:
Figure 2: As long as the user types letter "b", the table is updated with the results:
Given that partial text matching is not supported in GAE, I have tricked and implemented it arising from what is shown here: TL;DR: I have created a Customers Index, that contains a Search Document for every customer(doc_id=customer_key). Each Search Document contains Atom Fields for every customer's field I want to be able to search on(eg: first_name, last_name): every field is made up like this: suppose the last_name is Berlusconi, the field is going to be made up by these Atom Fields "b" "be" "ber" "berl" "berlu" "berlus" "berlusc" "berlusco" "berluscon" "berlusconi".
In this way I am able to perform full text matching in a way that resembles partial text matching. If I search for "Be", the Berlusconi customer is returned.
The search is made by Ajax calls: whenever a user types in the search field(the ajax is dalayed a little bit to see if the user keeps typing, to avoid sending a burst of requests), an Ajax call is made with the query string, and a json object is returned.
Now, things were working well in debugging, but I was testing it with a few people in the datastore. As long as I put many people, search looks very slow.
This is how I create search documents. This is called everytime a new customer is put to the datastore.
def put_search_document(cls, key):
"""
Called by _post_put_hook in BaseModel
"""
model = key.get()
_fields = []
if model:
_fields.append(search.AtomField(name="empty", value=""),) # to retrieve customers when no query string
_fields.append(search.TextField(name="sort1", value=model.last_name.lower()))
_fields.append(search.TextField(name="sort2", value=model.first_name.lower()))
_fields.append(search.TextField(name="full_name", value=Customer.tokenize1(
model.first_name.lower()+" "+model.last_name.lower()
)),)
_fields.append(search.TextField(name="full_name_rev", value=Customer.tokenize1(
model.last_name.lower()+" "+model.first_name.lower()
)),)
# _fields.append(search.TextField(name="telephone", value=Customer.tokenize1(
# model.telephone.lower()
# )),)
# _fields.append(search.TextField(name="email", value=Customer.tokenize1(
# model.email.lower()
# )),)
document = search.Document( # create new document with doc_id=key.urlsafe()
doc_id=key.urlsafe(),
fields=_fields)
index = search.Index(name=cls._get_kind()+"Index") # not in try-except: defer will catch and retry.
index.put(document)
#staticmethod
def tokenize1(string):
s = ""
for i in range(len(string)):
if i > 0:
s = s + " " + string[0:i+1]
else:
s = string[0:i+1]
return s
This is the search code:
#staticmethod
def search(ndb_model, query_phrase):
# TODO: search returns a limited number of results(20 by default)
# (See Search Results at https://cloud.google.com/appengine/docs/python/search/#Python_Overview)
sort1 = search.SortExpression(expression='sort1', direction=search.SortExpression.ASCENDING,
default_value="")
sort2 = search.SortExpression(expression='sort2', direction=search.SortExpression.ASCENDING,
default_value="")
sort_opt = search.SortOptions(expressions=[sort1, sort2])
results = search.Index(name=ndb_model._get_kind() + "Index").search(
search.Query(
query_string=query_phrase,
options=search.QueryOptions(
sort_options=sort_opt
)
)
)
print "----------------"
res_list = []
for r in results:
obj = ndb.Key(urlsafe=r.doc_id).get()
print obj.first_name + " "+obj.last_name
res_list.append(obj)
return res_list
Did anyone else had my same experience? If so, how have you solved it?
Thank you guys very much,
Marco Galassi
EDIT: names, email, phone are obviously totally invented.
Edit2: I have now moved to TextField, who look a little bit faster, but the problem still persist
I've got a basic django haystack elasticsearch installation running, that seems to be working.. until I hit an autocomplete problem:
It doesn't return autocompletion just the full field. another problem is with data that has CAPS, that isn't normalized (such as usernames..)
MY installation:
django 1.6.4
haystack 2.1.0
elasticsearch 1.3.1
py-elasticsearch 0.6.1
class SocialProfileIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
username = indexes.CharField(model_attr='username')
first_name = indexes.CharField(model_attr='first_name')
last_name = indexes.CharField(model_attr='last_name')
# Auto-complete
username_auto = indexes.EdgeNgramField(model_attr='username')
first_name_auto = indexes.EdgeNgramField(model_attr='first_name')
last_name_auto = indexes.EdgeNgramField(model_attr='last_name')
def get_model(self):
return SocialProfile
def index_queryset(self, using=None):
return self.get_model().objects.all()
Were in the view I return:
results = SearchQuerySet().models(SocialProfile).autocomplete(username_auto=q)
so when indexing a SocialProfile:
username=alonisser
when q (the query) is 'alonisser' I get the correct reply, But when I try 'alon' or similiar I don't get any results.
When I access elasticsearch directly through py-elasticsearch (without haystack):
es = Elasticsearch('http://elasticsearch.url:9200')
es.search('username_auto:alon', index='haystack')
I do get the correct result, so the is stored there and the problem is probably doing something wrong with haystack..
Similiar but different problems is when the searched item has Caps :like 'Alonisser' so searching for 'alonisser' doesn't return any result, but searching for 'Alonisser' does.
What am I doing wrong? Thanks for the help..
I think you already got an answer in the haystack forums but just to bring it out here also.
One way to get rid of the Caps problem is to use a custom prepare method in your index class although my haystack somehow handles it by default :S.
def prepare_username_auto(self, obj):
return obj.username.lower()
This will convert all usernames to lowercase when you run 'update_index'. Then you can also turn you user inserted search term to lower also which should produce correct results.
To search for a part of the word you need to use:
results = SearchQuerySet().models(SocialProfile).autocomplete(username_auto__startswith=q)
Say I have a model Person for which there is a PersonIndex class in search_indexes.py that makes all fields of it searchable. How can I make a search within only those entries where say the has_title field is True?
I tried the following, but it just searches among all entries, not just the ones where has_title is True:
srch = request.GET.get('search', "")
sqs = SearchQuerySet().filter(has_title=True)
clean_query = sqs.query.clean(srch)
results = sqs.raw_search(clean_query)
I am using Whoosh 2.4.1, Django-haystack 1.2.7 and Django 1.4.
Use filter(content=clean_query) instead of raw_search(clean_query). See here for more details.