django-haystack autocomplete returns too wide results - django

I have created an Index with field title_auto:
class GameIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, model_attr='title')
title = indexes.CharField(model_attr='title')
title_auto = indexes.NgramField(model_attr='title')
Elastic search settings look like this:
ELASTICSEARCH_INDEX_SETTINGS = {
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_ngram"],
"token_chars": ["letter", "digit"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
}
}
}
}
I try to do autocomplete search, it works, however returns too many irrelevant results:
qs = SearchQuerySet().models(Game).autocomplete(title_auto=search_phrase)
OR
qs = SearchQuerySet().models(Game).filter(title_auto=search_phrase)
Both of them produce the same output.
If search_phrase is "monopoly", first results contain "Monopoly" in their titles, however, as there are only 2 relevant items, it returns 51. The others have nothing to do with "Monopoly" at all.
So my question is - how can I change relevance of the results?

It's hard to tell for sure since I haven't seen your full mapping, but I suspect the problem is that the analyzer (one of them) is being used for both indexing and searching. So when you index a document, lots of ngram terms get created and indexed. If you search and your search text is also analyzed the same way, lots of search terms get generated. Since your smallest ngram is a single letter, pretty much any query is going to match a lot of documents.
We wrote a blog post about using ngrams for autocomplete that you might find helpful, here: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. But I'll give you a simpler example to illustrate what I mean. I'm not super familiar with haystack so I probably can't help you there, but I can explain the issue with ngrams in Elasticsearch.
First I'll set up an index that uses an ngram analyzer for both indexing and searching:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "nGram_analyzer"
}
}
}
}
}
and add some docs:
PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"title":"monopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"title":"oligopoly"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"title":"plutocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"title":"theocracy"}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"title":"democracy"}
and run a simple match search for "poly":
POST /test_index/_search
{
"query": {
"match": {
"title": "poly"
}
}
}
it returns all five documents:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 4.729521,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 4.729521,
"_source": {
"title": "oligopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 4.3608603,
"_source": {
"title": "monopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.0197333,
"_source": {
"title": "plutocracy"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.31496215,
"_source": {
"title": "theocracy"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "5",
"_score": 0.31496215,
"_source": {
"title": "democracy"
}
}
]
}
}
This is because the search term "poly" gets tokenized into the terms "p", "o", "l", and "y", which, since the "title" field in each of the documents was tokenized into single-letter terms, matches every document.
If we rebuild the index with this mapping instead (same analyzer and docs):
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "standard"
}
}
}
}
the query will return what we expect:
POST /test_index/_search
{
"query": {
"match": {
"title": "poly"
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.5108256,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.5108256,
"_source": {
"title": "monopoly"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1.5108256,
"_source": {
"title": "oligopoly"
}
}
]
}
}
Edge ngrams work similarly, except that only terms that start at the beginning of the words will be used.
Here is the code I used for this example:
http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579

Unfortunately at this point in time there seems to be no way (apart from implementing a custom backend) to configure search analyzers and index analyzers through Django-Haystack separately.
In case Django-Haystack autocomplete returns too wide results you can make use of the score value provided with each search result to optimize the output.
if search_query != "":
# Use autocomplete query or filter
# with results_filtered being a SearchQuerySet()
results_filtered = results_filtered.filter(text=search_query)
#Remove objects with a low score
for result in results_filtered:
if result.score < SEARCH_SCORE_THRESHOLD:
results_filtered = results_filtered.exclude(id=result.id)
It worked reasonable well for me without having to define my own backend and scheme building.

Related

How to get document from elastic search with partial query string?

I have three documents indexed with title "manage", "manager", and "management".
I am searching by following query:
query: {
query_string: {
"query": "manage*",
"fields": ["title"],
}
}
}
I am getting same score for all three documents. I want document with "title": "manage" first, then manager and management.
There are two ways to achieve what you want. The easiest one to try out is to resort to script-based sorting and return a score that matches the length of the data:
GET test/_search
{
"sort": {
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "doc['title.keyword'].value.length()"
},
"order": "asc"
}
},
"query": {
"query_string": {
"query": "manage*",
"fields": [
"title"
]
}
}
}
Note: if you don't have the title.keyword field, you can change your script to work from the source directly:
params._source['title'].length()
You'll get manage (with score of 6), then manager (with score of 7) and then management (with score of 10).
The other way to achieve this is to actually index another integer field (e.g. titleLength) with the actual length of the title field and sort by titleLength.
The query above searches all the documents containing manage, but here since boost is applied to manage, so the document containing manage will have a higher score as compared to other documents.
To know more about Query String Query refer this
Index Data
{ "name":"manage" }
{ "name":"manager"}
{ "name":"management"}
Search Query
{
"query": {
"query_string": {
"fields": [
"name"
],
"query": "manage^2*"
}
}
}
Search Result:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 3.3263016,
"_source": {
"name": "manage"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"name": "manager"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "3",
"_score": 1.0,
"_source": {
"name": "management"
}
}
]
Edit 1:
If 1 more document is indexed:
{ "name":"managers" }
Search Query:
{
"query": {
"query_string": {
"query": "manage~"
}
}
}
Search Result:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.87546873,
"_source": {
"name": "manage"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 0.7295572, -->score is different
"_source": {
"name": "manager"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "4",
"_score": 0.58364576,
"_source": {
"name": "managers"
}
}
]
In your case, for management you have more than 2 edit distance i.e. manage -> managem --> manageme --> managemen --> management.
And if the search is made by using a fuzzy query, then their maximum only two edits are allowed .
So, therefore, management will not match here (by the above search query), rest all words will match (which have edit distance<=2), having different scores.

Elastic Search Sort

I have a table for some activities like
[
{
"id": 123,
"name": "Ram",
"status": 1,
"activity": "Poster Design"
},
{
"id": 123,
"name": "Ram",
"status": 1,
"activity": "Poster Design"
},
{
"id": 124,
"name": "Leo",
"categories": [
"A",
"B",
"C"
],
"status": 1,
"activity": "Brochure"
},
{
"id": 134,
"name": "Levin",
"categories": [
"A",
"B",
"C"
],
"status": 1,
"activity": "3D Printing"
}
]
I want to get this data from elastic search 5.5 by sorting on field activity, but I need all the data corresponding to name = "Ram" first and then remaining in a single query.
You can use function score query to boost the result based on match for the filter(this case ram in name).
Following query should work for you
POST sort_index/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"boost": "5",
"functions": [{
"filter": {
"match": {
"name": "ram"
}
},
"random_score": {},
"weight": 1000
}],
"score_mode": "max"
}
},
"sort": [{
"activity.keyword": {
"order": "desc"
}
}]
}
I would suggest using a bool query combined with the should clause.
U will also need to use the sort clause on your field.

ElasticSearch regex query doesn't work

I am using ES 2.4.6 with Java 8, and i created a document object as following:
#Document(indexName = "airports", type = "airport")
public class Airport {
#Id
private String id;
#Field(type = String)
private String name;
}
And i successfully search several airport objects to ES, with following
names: "San Francisco", "San Mateo", "Santiago", "Palo Alto", "Big San"
The JSON content inside ES looks like following:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1,
"hits": [
{
"_index": "airports",
"_type": "airport",
"_id": "SSMlsTWIYefbXHCnYEwEY",
"_score": 1,
"_source": {
"id": "SSMlsTWIYefbXHCnYEwEY",
"name": "Santiago"
}
},
{
"_index": "airports",
"_type": "airport",
"_id": "LlDcKuywPjURNeIISjXLjC",
"_score": 1,
"_source": {
"id": "LlDcKuywPjURNeIISjXLjC",
"name": "San Mateo"
}
},
{
"_index": "airports",
"_type": "airport",
"_id": "CVIjEHYphSmZIjYbHCMwtkqfKWtEHVh",
"_score": 1,
"_source": {
"id": "CVIjEHYphSmZIjYbHCMwtkqfKWtEHVh",
"name": "San Francisco"
}
},
{
"_index": "airports",
"_type": "airport",
"_id": "gbntKR",
"_score": 1,
"_source": {
"id": "gbntKR",
"name": "Palo Alto"
}
},
{
"_index": "airports",
"_type": "airport",
"_id": "bKosUdHeseMMboyaejv",
"_score": 1,
"_source": {
"id": "bKosUdHeseMMboyaejv",
"name": "Big San"
}
}
]
}
}
Then i have following curl command to use regex query to find all airport
names staring with "san" ignoring case, i did:
curl -XGET 'localhost:9200/airports/airport/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"regexp":{
"name": "^(?i)san"
}
}
}
'
I use the regex "^(?i)san" directly match against those airport names,
it works as expect:
String regex = "^(?i)san";
assertTrue("San Francisco".matches(regex));
assertTrue("San Mateo".matches(regex));
assertTrue("Santiago".matches(regex));
assertTrue(!"Big San".matches(regex));
So does anyone know why ES regex query returns empty result back? Now, if
i use "san" as regex, all 4 names return back, and if i use "San", nothing returns back.
You can use Match Phrase Prefix for the problem mentioned above.
{
"query": {
"match_phrase_prefix": {
"name": "San"
}
}
}
See if it resolves your problem.

Elasticsearch - Tokenizer configuration

Someone have any idea of what tokenizer to use and how to enable rule for the below,
Input : ["test1-data.example.com", "test2-new.example.com", "new1-test.example.com"]
Output (expected ) :
test1-data.example.com test2-new.example.com new1-test.exampl.com
It's not obvious whether it solves your problem or not, but here's one way you can do what it sounds like you're asking:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"_all": {
"enabled": true,
"store": true,
"index": "not_analyzed"
},
"properties": {
"text_field": {
"type": "string",
"include_in_all": true
}
}
}
}
}
PUT /test_index/doc/1
{
"text_field": ["test1-data.example.com", "test2-new.example.com", "new1-test.example.com"]
}
POST /test_index/_search
{
"fields": [
"_all"
]
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"fields": {
"_all": "test1-data.example.com test2-new.example.com new1-test.example.com "
}
}
]
}
}
Here's the code in Sense:
http://sense.qbox.io/gist/45200711a41268634439b669e18541e68042ac8a

CouchDB-Why my rerduce is always coming as false ? I am not able to reduce anything properly

I am new to CouchDB. I have a 9 gb dataset loaded into my couchdb. I am able to map everything correctly. But I cannot reduce any of the results using the code written in the reduce column. When i tried log, log shows that rereduce values as false. Do i need to do anything special while doing the Map() or how to set the rereduce value is TRUE??
A sample of my data is as follows:
{
"_id": "33d4d945613344f13a3ee92933b160bf",
"_rev": "1-0425ca93e3aa939dff46dd51c3ab86f2",
"release": {
"genres": {
"genre": "Electronic"
},
"status": "Accepted",
"videos": {
"video": [
{
"title": "[1995] bola - krak jakomo",
"duration": 349,
"description": "[1995] bola - krak jakomo",
"src": "http://www.youtube.com/watch?v=KrELXoYThpI",
"embed": true
},
{
"title": "Bola - Forcasa 3",
"duration": 325,
"description": "Bola - Forcasa 3",
"src": "http://www.youtube.com/watch?v=Lz9itUo5xtc",
"embed": true
},
{
"title": "Bola (Darrell Fitton) - Metalurg (MV)",
"duration": 439,
"description": "Bola (Darrell Fitton) - Metalurg (MV)",
"src": "http://www.youtube.com/watch?v=_MYpOOMRAeQ",
"embed": true
}
]
},
"labels": {
"label": {
"catno": "SKA005",
"name": "Skam"
}
},
"companies": "",
"styles": {
"style": [
"Downtempo",
"Experimental",
"Ambient"
]
},
"formats": {
"format": {
"text": "",
"name": "Vinyl",
"qty": 1,
"descriptions": {
"description": [
"12\"",
"Limited Edition",
"33 ⅓ RPM"
]
}
}
},
"country": "UK",
"id": 1928,
"released": "1995-00-00",
"artists": {
"artist": {
"id": 390,
"anv": "",
"name": "Bola",
"role": "",
"tracks": "",
"join": ""
}
},
"title": 1,
"master_id": 13562,
"tracklist": {
"track": [
{
"position": "A1",
"duration": "4:33",
"title": "Forcasa 3"
},
{
"position": "A2",
"duration": "5:48",
"title": "Krak Jakomo"
},
{
"position": "B1",
"duration": "7:50",
"title": "Metalurg 2"
},
{
"position": "B2",
"duration": "6:40",
"title": "Balloom"
}
]
},
"data_quality": "Correct",
"extraartists": {
"artist": {
"id": 388200,
"anv": "",
"name": "Paul Solomons",
"role": "Mastered By",
"tracks": "",
"join": ""
}
},
"notes": "Limited to 480 copies.\nA1 is a shorter version than that found on the 'Soup' LP.\nA2 ends in a lock groove."
}
}
My intention is to count the mapped values. My mapping function is as follows:
function(doc){
if(doc.release)
emit(doc.release.title,1)
}
Map results shows around 5800 results
I want to use the following functions in the reduce tab to count:
Reduce:
_count or _sum
It does not give single rounded value. Even i cannot get the simple _count operations right !!! :(
for screenshot,
Please help me !!!
What you got was the sum of values per title. What you wanted, was the sum of values in general.
Change the grouping drop-down list to none.
Check CouchdDB's wiki for more details on grouping.