I'm learning ElasticSearch in the hopes of dumping my business data into ES and viewing it with Kibana. After a week of various issues I finally have ES and Kibana working (1.7.0 and 4 respectively) on 2 Ubuntu 14.04 desktop machines (clustered).
The issue I'm having now is how to get the data into ES best. The data flow is that I capture the PHP global variables $_REQUEST and $_SERVER for each visit to text file with a unique ID. From there, if they fill in a form I capture that data in a text file also named with that unique ID in a different directory. Then my customers tell me if that form fill was any good with a delay of up to 50 days.
So I'm starting with the visitor data - $_REQUEST and $_SERVER. A lot of it is redundant so I'm really just attempting to capture the timestamp of their arrival, their IP, the IP of the server they visited, the domain they visited, the unique ID, and their User Agent. So I created this mapping:
time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # Originally this included 'index': 'not analyzed' as well
visit_mapping = {
'properties': {
'uniqID': str_not_analyzed,
'pages': str_not_analyzed,
'domain': str_not_analyzed,
'Srvr IP': str_not_analyzed,
'Visitor IP': str_not_analyzed,
'Agent': { 'type': 'string' },
'Referrer': { 'type': 'string' },
'Entrance Time': time_date_mapping, # Stored as a Unix timestamp
'Request Time': time_date_mapping, # Stored as a Unix timestamp
'Raw': { 'type': 'string', 'index': 'not_analyzed' },
},
}
I then enter it into ES with:
es.index(
index=Visit_to_ElasticSearch.INDEX,
doc_type=Visit_to_ElasticSearch.DOC_TYPE,
id=self.uniqID,
timestamp=int(math.floor(self._visit['Entrance Time'])),
body=visit
)
When I look at the data in the index on ES only Entrance Time, _id, _type, domain, and uniqID are indexed for searching (according to Kibana). All of the data is present in the document but most of the fields show "Unindexed fields can not be searched."
Additionally, I was attempting to just get a Pie chart of the Agents. But I couldn't figure out to get visualized because no matter what boxes I click on the Agent field is never an option for aggregation. Just mentioned it because it seems the fields which are indexed do show up.
I've attempting to mimic the mapping examples in the elasticsearch.py example which pulls in github. Can someone correct me on how I'm using that map?
Thanks
------------ Mapping -------------
{
"visits": {
"mappings": {
"visit": {
"properties": {
"Agent": {
"type": "string"
},
"Entrance Time": {
"type": "date",
"format": "dateOptionalTime"
},
"Raw": {
"properties": {
"Entrance Time": {
"type": "double"
},
"domain": {
"type": "string"
},
"uniqID": {
"type": "string"
}
}
},
"Referrer": {
"type": "string"
},
"Request Time": {
"type": "string"
},
"Srvr IP": {
"type": "string"
},
"Visitor IP": {
"type": "string"
},
"domain": {
"type": "string"
},
"uniqID": {
"type": "string"
}
}
}
}
}
}
------------- Update and New Mapping -----------
So I deleted the index and recreated it. The original index had some data in it from before I knew anything about mapping the data to specific field types. This seemed to fix the issue with only a few fields being indexed.
However, parts of my mapping appear to be ignored. Specifically the Agent string mapping:
visit_mapping = {
'properties': {
'uniqID': str_not_analyzed,
'pages': str_not_analyzed,
'domain': str_not_analyzed,
'Srvr IP': str_not_analyzed,
'Visitor IP': str_not_analyzed,
'Agent': { 'type': 'string', 'index': 'not_analyzed' },
'Referrer': { 'type': 'string' },
'Entrance Time': time_date_mapping,
'Request Time': time_date_mapping,
'Raw': { 'type': 'string', 'index': 'not_analyzed' },
},
}
Here's the output of http://localhost:9200/visits_test2/_mapping
{
"visits_test2": {
"mappings": {
"visit": {
"properties": {
"Agent":{"type":"string"},
"Entrance Time": {"type":"date","format":"dateOptionalTime"},
"Raw": {
"properties": {
"Entrance Time":{"type":"double"},
"domain":{"type":"string"},
"uniqID":{"type":"string"}
}
},
"Referrer":{"type":"string"},
"Request Time": {"type":"date","format":"dateOptionalTime"},
"Srvr IP":{"type":"string"},
"Visitor IP":{"type":"string"},
"domain":{"type":"string"},
"uniqID":{"type":"string"}
}
}
}
}
}
Note that I've used an entirely new index. The reason being that I wanted to make to sure nothing was carrying over from one to the next.
Note that I'm using the Python library elasticsearch.py and following their examples for mapping syntax.
--------- Python Code for Entering Data into ES, per comment request -----------
Below is a file name mapping.py, I have not yet fully commented the code since this was just code to test whether this method of data entry into ES was viable. If it is not self-explanatory, let me know and I'll add additional comments.
Note, I programmed in PHP for years before picking up Python. In order to get up and running faster with Python I created a couple of files with basic string and file manipulation functions and made them into a package. They are written in Python and meant to mimic the behavior of a built-in PHP function. So when you see a call to php_basic_* it is one of those functions.
# Standard Library Imports
import json, copy, datetime, time, enum, os, sys, numpy, math
from datetime import datetime
from enum import Enum, unique
from elasticsearch import Elasticsearch
# My Library
import basicconfig, mybasics
from mybasics.cBaseClass import BaseClass, BaseClassErrors
from mybasics.cHelpers import HandleErrors, LogLvl
# This imports several constants, a couple of functions, and a helper class
from basicconfig.startup_config import *
# Connect to ElasticSearch
es = Elasticsearch([{'host': 'localhost', 'port': '9200'}])
# Create mappings of a visit
time_date_mapping = { 'type': 'date_time' }
str_not_analyzed = { 'type': 'string'} # This originally included 'index': 'not_analyzed' as well
visit_mapping = {
'properties': {
'uniqID': str_not_analyzed,
'pages': str_not_analyzed,
'domain': str_not_analyzed,
'Srvr IP': str_not_analyzed,
'Visitor IP': str_not_analyzed,
'Agent': { 'type': 'string', 'index': 'not_analyzed' },
'Referrer': { 'type': 'string' },
'Entrance Time': time_date_mapping,
'Request Time': time_date_mapping,
'Raw': { 'type': 'string', 'index': 'not_analyzed' },
'Pages': { 'type': 'string', 'index': 'not_analyzed' },
},
}
class Visit_to_ElasticSearch(object):
"""
"""
INDEX = 'visits'
DOC_TYPE = 'visit'
def __init__(self, fname, index=True):
"""
"""
self._visit = json.loads(php_basic_files.file_get_contents(fname))
self._pages = self._visit.pop('pages')
self.uniqID = self._visit['uniqID']
self.domain = self._visit['domain']
self.entrance_time = self._convert_time(self._visit['Entrance Time'])
# Get a list of the page IDs
self.pages = self._pages.keys()
# Extra IPs and such from a single page
page = self._pages[self.pages[0]]
srvr = page['SERVER']
req = page['REQUEST']
self.visitor_ip = srvr['REMOTE_ADDR']
self.srvr_ip = srvr['SERVER_ADDR']
self.request_time = self._convert_time(srvr['REQUEST_TIME'])
self.agent = srvr['HTTP_USER_AGENT']
# Now go grab data that might not be there...
self._extract_optional()
if index is True:
self.index_with_elasticsearch()
def _convert_time(self, ts):
"""
"""
try:
dt = datetime.fromtimestamp(ts)
except TypeError:
dt = datetime.fromtimestamp(float(ts))
return dt.strftime('%Y-%m-%dT%H:%M:%S')
def _extract_optional(self):
"""
"""
self.referrer = ''
def index_with_elasticsearch(self):
"""
"""
visit = {
'uniqID': self.uniqID,
'pages': [],
'domain': self.domain,
'Srvr IP': self.srvr_ip,
'Visitor IP': self.visitor_ip,
'Agent': self.agent,
'Referrer': self.referrer,
'Entrance Time': self.entrance_time,
'Request Time': self.request_time,
'Raw': self._visit,
'Pages': php_basic_str.implode(', ', self.pages),
}
es.index(
index=Visit_to_ElasticSearch.INDEX,
doc_type=Visit_to_ElasticSearch.DOC_TYPE,
id=self.uniqID,
timestamp=int(math.floor(self._visit['Entrance Time'])),
body=visit
)
es.indices.create(
index=Visit_to_ElasticSearch.INDEX,
body={
'settings': {
'number_of_shards': 5,
'number_of_replicas': 1,
}
},
# ignore already existing index
ignore=400
)
In case it matters this is the simple loop I use to dump the data into ES:
for f in all_files:
try:
visit = mapping.Visit_to_ElasticSearch(f)
except IOError:
pass
where all_files is a list of all the visit files (full path) I have in my test data set.
Here is a sample visit file from a Google Bot visit:
{u'Entrance Time': 1407551587.7385,
u'domain': u'############',
u'pages': {u'6818555600ccd9880bf7acef228c5d47': {u'REQUEST': [],
u'SERVER': {u'DOCUMENT_ROOT': u'/var/www/####/',
u'Entrance Time': 1407551587.7385,
u'GATEWAY_INTERFACE': u'CGI/1.1',
u'HTTP_ACCEPT': u'*/*',
u'HTTP_ACCEPT_ENCODING': u'gzip,deflate',
u'HTTP_CONNECTION': u'Keep-alive',
u'HTTP_FROM': u'googlebot(at)googlebot.com',
u'HTTP_HOST': u'############',
u'HTTP_IF_MODIFIED_SINCE': u'Fri, 13 Jun 2014 20:26:33 GMT',
u'HTTP_USER_AGENT': u'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
u'PATH': u'/usr/local/bin:/usr/bin:/bin',
u'PHP_SELF': u'/index.php',
u'QUERY_STRING': u'',
u'REDIRECT_SCRIPT_URI': u'http://############/',
u'REDIRECT_SCRIPT_URL': u'############',
u'REDIRECT_STATUS': u'200',
u'REDIRECT_URL': u'############',
u'REMOTE_ADDR': u'############',
u'REMOTE_PORT': u'46271',
u'REQUEST_METHOD': u'GET',
u'REQUEST_TIME': u'1407551587',
u'REQUEST_URI': u'############',
u'SCRIPT_FILENAME': u'/var/www/PIAN/index.php',
u'SCRIPT_NAME': u'/index.php',
u'SCRIPT_URI': u'http://############/',
u'SCRIPT_URL': u'/############/',
u'SERVER_ADDR': u'############',
u'SERVER_ADMIN': u'admin#############',
u'SERVER_NAME': u'############',
u'SERVER_PORT': u'80',
u'SERVER_PROTOCOL': u'HTTP/1.1',
u'SERVER_SIGNATURE': u'<address>Apache/2.2.22 (Ubuntu) Server at ############ Port 80</address>\n',
u'SERVER_SOFTWARE': u'Apache/2.2.22 (Ubuntu)',
u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'},
u'SESSION': {u'Entrance Time': 1407551587.7385,
u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}}},
u'uniqID': u'bbc398716f4703cfabd761cc8d4101a1'}
Now I understand better why the Raw field is an object instead of a simple string since it is assigned self._visit which in turn was initialized with json.loads(php_basic_files.file_get_contents(fname)).
Anyway, based on all the information you've given above, my take is that the mapping was never installed via put_mapping. From there on, there's no way anything else can work the way you like. I suggest you modify your code to install the mapping before you index your first visit document.
Related
I'm trying to get Costs using CostExplorer Client in boto3. But I can't find the values to use as a Dimension filter. The documentation says that we can extract those values from GetDimensionValues but how do I use GetDimensionValues.
response = client.get_cost_and_usage(
TimePeriod={
'Start': str(start_time).split()[0],
'End': str(end_time).split()[0]
},
Granularity='DAILY',
Filter = {
'Dimensions': {
'Key':'USAGE_TYPE',
'Values': [
'DataTransfer-In-Bytes'
]
}
},
Metrics=[
'NetUnblendedCost',
],
GroupBy=[
{
'Type': 'DIMENSION',
'Key': 'SERVICE'
},
]
)
The boto3 reference for GetDimensionValues has a lot of details on how to use that call. Here's some sample code you might use to print out possible dimension values:
response = client.get_dimension_values(
TimePeriod={
'Start': '2022-01-01',
'End': '2022-06-01'
},
Dimension='USAGE_TYPE',
Context='COST_AND_USAGE',
)
for dimension_value in response["DimensionValues"]:
print(dimension_value["Value"])
Output:
APN1-Catalog-Request
APN1-DataTransfer-Out-Bytes
APN1-Requests-Tier1
APN2-Catalog-Request
APN2-DataTransfer-Out-Bytes
APN2-Requests-Tier1
APS1-Catalog-Request
APS1-DataTransfer-Out-Bytes
.....
I have written this function which returns fine when it is just returning as a String. I have followed the syntax for the response card very closely and it passes my test case in lambda. However when it's called through Lex it throws an error which i'll post below. It says fulfillmentState cannot be null but in the error it throws it shows that it is not null.
I have tried switching the order of the dialogue state and response card, i have tried switching the order of "type" and "fulfillmentState". Function:
def backup_phone(intent_request):
back_up_location = get_slots(intent_request)["BackupLocation"]
phone_os = get_slots(intent_request)["PhoneType"]
try:
from googlesearch import search
except ImportError:
print("No module named 'google' found")
# to search
query = "How to back up {} to {}".format(phone_os, back_up_location)
result_list = []
for j in search(query, tld="com", num=5, stop=5, pause=2):
result_list.append(j)
return {
"dialogAction": {
"fulfilmentState": "Fulfilled",
"type": "Close",
"contentType": "Plain Text",
'content': "Here you go",
},
'responseCard': {
'contentType': 'application/vnd.amazonaws.card.generic',
'version': 1,
'genericAttachments': [{
'title': "Please select one of the options",
'subTitle': "{}".format(query),
'buttons': [
{
"text": "{}".format(result_list[0]),
"value": "test"
},
]
}]
}
}
screenshot of test case passing in lambda: https://ibb.co/sgjC2WK
screenshot of error throw in Lex: https://ibb.co/yqwN42m
Text for the error in Lex:
"An error has occurred: Invalid Lambda Response: Received invalid response from Lambda: Can not construct instance of CloseDialogAction, problem: fulfillmentState must not be null for Close dialog action at [Source: {"dialogAction": {"fulfilmentState": "Fulfilled", "type": "Close", "contentType": "Plain Text", "content": "Here you go"}, "responseCard": {"contentType": "application/vnd.amazonaws.card.generic", "version": 1, "genericAttachments": [{"title": "Please select one of the options", "subTitle": "How to back up Iphone to windows", "buttons": [{"text": "https://www.easeus.com/iphone-data-transfer/how-to-backup-your-iphone-with-windows-10.html", "value": "test"}]}]}}; line: 1, column: 121]"
I fixed the issue by sending everything i was trying to return through a function, and then storing all the info in the correct syntax into an object, which i then returned from the function. Relevant code below:
def close(session_attributes, fulfillment_state, message, response_card):
response = {
'sessionAttributes': session_attributes,
'dialogAction': {
'type': 'Close',
'fulfillmentState': fulfillment_state,
'message': message,
"responseCard": response_card,
}
}
return response
def backup_phone(intent_request):
back_up_location = get_slots(intent_request)["BackupLocation"]
phone_os = get_slots(intent_request)["PhoneType"]
try:
from googlesearch import search
except ImportError:
print("No module named 'google' found")
# to search
query = "How to back up {} to {}".format(phone_os, back_up_location)
result_list = []
for j in search(query, tld="com", num=1, stop=1, pause=1):
result_list.append(j)
return close(intent_request['sessionAttributes'],
'Fulfilled',
{'contentType': 'PlainText',
'content': 'test'},
{'version': 1,
'contentType': 'application/vnd.amazonaws.card.generic',
'genericAttachments': [
{
'title': "{}".format(query.lower()),
'subTitle': "Please select one of the options",
"imageUrl": "",
"attachmentLinkUrl": "{}".format(result_list[0]),
'buttons': [
{
"text": "{}".format(result_list[0]),
"value": "Thanks"
},
]
}
]
}
)
I'm testing MongoDB as DB with a Flask's REST Server (and flask-pymongo), using the mockupdb module. I want to receive a DateTime in the json request, and store it as Date object, to perform some range query using this field in the future, so, I send the data as EJSON (BSON) to keep the data exactly as I.
This is the testcase:
#pytest.fixture()
def client_and_mongoserver():
random.seed()
mongo_server = MockupDB(auto_ismaster=True, verbose=True)
mongo_server.run()
config = Config()
config.MONGO_URI = mongo_server.uri + '/test'
flask_app = create_app(config)
flask_app.testing = True
client = flask_app.test_client()
yield client, mongo_server
mongo_server.stop()
def test_insert(client_and_mongoserver):
client, server = client_and_mongoserver
headers = [('Content-Type', 'application/json')]
id = str(uuid.uuid4()).encode('utf-8')[:12]
now = datetime.now()
obj_id = ObjectId(id)
toInsert = {
"_id": obj_id,
"datetime": now
}
toVerify = {
"_id": obj_id,
"datetime": now
}
future = go(client.post, '/api/insert', data=dumps(toInsert), headers=headers)
request = server.receives(
OpMsg({
'insert': 'test',
'ordered': True,
'$db': "test",
'$readPreference': {"mode": "primary"},
'documents': [
toVerify
]
}, namespace='test')
)
request.ok(cursor={'inserted_id': id})
# act
http_response = future()
# assert
data = http_response.get_data(as_text=True)
This is the endpoint. Before the insert call I convert the datetime string to datetime object:
from flask_restful import Resource
from bson import json_util
class Insert(Resource):
def post(self):
if not request.json:
abort(400)
json_data = json_util.loads(request.data)
result = mongo.db.test.insert_one(json_data)
return {'message': 'OK'}, 200
But the test generate this assertion:
self = MockupDB(localhost, 37213)
args = (OpMsg({"insert": "test", "ordered": true, "$db": "test", "$readPreference": {"mode": "primary"}, "documents": [{"_id": {"$oid": "63343264363661622d393764"}, "datetime": {"$date": 1543493218306}}]}, namespace="test"),)
kwargs = {}, timeout = 10, end = 1543504028.309115
matcher = Matcher(OpMsg({"insert": "test", "ordered": true, "$db": "test", "$readPreference": {"mode": "primary"}, "documents": [{"_id": {"$oid": "63343264363661622d393764"}, "datetime": {"$date": 1543493218306}}]}, namespace="test"))
request = OpMsg({"insert": "test", "ordered": true, "$db": "test", "$readPreference": {"mode": "primary"}, "documents": [{"_id": {"$oid": "63343264363661622d393764"}, "datetime": {"$date": 1543493218306}}]}, namespace="test")
def receives(self, *args, **kwargs):
"""Pop the next `Request` and assert it matches.
Returns None if the server is stopped.
Pass a `Request` or request pattern to specify what client request to
expect. See the tutorial for examples. Pass ``timeout`` as a keyword
argument to override this server's ``request_timeout``.
"""
timeout = kwargs.pop('timeout', self._request_timeout)
end = time.time() + timeout
matcher = Matcher(*args, **kwargs)
while not self._stopped:
try:
# Short timeout so we notice if the server is stopped.
request = self._request_q.get(timeout=0.05)
except Empty:
if time.time() > end:
raise AssertionError('expected to receive %r, got nothing' % matcher.prototype)
else:
if matcher.matches(request):
return request
else:
raise AssertionError('expected to receive %r, got %r'
> % (matcher.prototype, request))
E AssertionError: expected to receive OpMsg({"insert": "test", "ordered": true, "$db": "test", "$readPreference": {"mode": "primary"}, "documents": [{"_id": {"$oid": "63343264363661622d393764"}, "datetime": {"$date": 1543493218306}}]}, namespace="test"), got OpMsg({"insert": "test", "ordered": true, "$db": "test", "$readPreference": {"mode": "primary"}, "documents": [{"_id": {"$oid": "63343264363661622d393764"}, "datetime": {"$date": 1543493218306}}]}, namespace="test")
.venv/lib/python3.6/site-packages/mockupdb/__init__.py:1291: AssertionError
The value match but the assertion is raised either way.
How can I test the Date object using flask?
EDIT:
As pointed out by #bauman.space. The lack of:
'$db': 'test', # this key appears somewhere at the driver
'$readPreference': {"mode": "primary"}, # so does this one
Don't affect the validation made by mockupdb. I'd tested that in other test cases.
EDIT 2: Change question to prevent confusion
your assertion is quite descriptive
AssertionError:
expected to receive
OpMsg(
{"insert": "test",
"ordered": true,
"documents": [{"_id": "a3dbe8a7e1cc43469b706a8877b0a14a",
"datetime": {"$date": 1542901445120}}]
}, namespace="test"
),
got
OpMsg(
{"insert": "test",
"ordered": true,
"$db": "test",
"$readPreference": {"mode": "primary"},
"documents": [{"_id": "a3dbe8a7e1cc43469b706a8877b0a14a",
"datetime": {"$date": 1542901445120}}]
}, namespace="test")
looks like you simply need to include some of the standard MongoDB keys in your verification code.
Swap yours out with this and give it a try?
request = server.receives(
OpMsg({
'insert': 'test',
'ordered': True,
'$db': 'test', # this key appears somewhere at the driver
'$readPreference': {"mode": "primary"}, # so does this one
'documents': [
toVerify
]
}, namespace='test')
)
I am using Django Haystack with Elasticsearch. I have a string field called 'code' in this type of format:
76-010
I would like to be able to search
76-
And get as a result
76-111
76-110
76-210
...
and so on.
but I don't want to get these results:
11-760
11-076
...
I already have a custom elastic search backend but I am not sure how should i indexing it to get the desired behavior.
class ConfigurableElasticBackend(ElasticsearchSearchBackend):
def __init__(self, connection_alias, **connection_options):
# see http://stackoverflow.com/questions/13636419/elasticsearch-edgengrams-and-numbers
self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['tokenizer'] = 'standard'
self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['filter'].append('lowercase')
super(ConfigurableElasticBackend, self).__init__(connection_alias, **connection_options)
The idea is to use an edgeNGram tokenizer in order to index every prefix of your code field. For instance, we would like 76-111 to be indexed as 7, 76, 76-, 76-1, 76-11 and 76-111. That way you will find 766-11 by searching for any of its prefixes.
Note that this article provides a full-fledge solution to your problem. The index settings for your case would look like this in Django code. You can then follow that article to wrap it up, but this should get you started.
class ConfigurableElasticBackend(ElasticsearchSearchBackend):
DEFAULT_SETTINGS = {
"settings": {
"analysis": {
"analyzer": {
"edgengram_analyzer": {
"tokenizer": "edgengram_tokenizer",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "25"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"code": {
"type": "string",
"analyzer": "edgengram_analyzer"
}
}
}
}
}
def __init__(self, connection_alias, **connection_options):
super(ConfigurableElasticBackend, self).__init__(connection_alias, **connection_options)
self.conn = pyelasticsearch.ElasticSearch(connection_options['URL'], timeout=self.timeout)
self.index_name = connection_options['INDEX_NAME']
# create the index with the above settings
self.conn.create_index(self.index_name, self.DEFAULT_SETTINGS)
I'm using Django 1.5 with django-haystack 2.0 and an elasticsearch backend. I'm trying to search by an exact attribute match. However, I'm getting "similar" results even though I'm using both the __exact operator and the Exact() class. How can I prevent this behavior?
For example:
# models.py
class Person(models.Model):
name = models.TextField()
# search_indexes.py
class PersonIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name = indexes.CharField(model_attr="name")
def get_model(self):
return Person
def index_queryset(self, using=None):
return self.get_model().objects.all()
# templates/search/indexes/people/person_text.txt
{{ object.name }}
>>> p1 = Person(name="Simon")
>>> p1.save()
>>> p2 = Person(name="Simons")
>>> p2.save()
$ ./manage.py rebuild_index
>>> person_sqs = SearchQuerySet().models(Person)
>>> person_sqs.filter(name__exact="Simons")
[<SearchResult: people.person (name=u'Simon')>
<SearchResult: people.person (name=u'Simons')>]
>>> person_sqs.filter(name=Exact("Simons", clean=True))
[<SearchResult: people.person (name=u'Simon')>
<SearchResult: people.person (name=u'Simons')>]
I only want the search result for "Simons" - the "Simon" result should not show up.
Python3, Django 1.10, Elasticsearch 2.4.4.
TL;DR: define custom tokenizer (not filter)
Verbose explanation
a) use EdgeNgramField:
# search_indexes.py
class PersonIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.EdgeNgramField(document=True, use_template=True)
...
b) template:
# templates/search/indexes/people/person_text.txt
{{ object.name }}
c) create custom search backend:
# backends.py
from django.conf import settings
from haystack.backends.elasticsearch_backend import (
ElasticsearchSearchBackend,
ElasticsearchSearchEngine,
)
class CustomElasticsearchSearchBackend(ElasticsearchSearchBackend):
def __init__(self, connection_alias, **connection_options):
super(CustomElasticsearchSearchBackend, self).__init__(
connection_alias, **connection_options)
setattr(self, 'DEFAULT_SETTINGS', settings.ELASTICSEARCH_INDEX_SETTINGS)
class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):
backend = CustomElasticsearchSearchBackend
d) define custom tokenizer (not filter!):
# settings.py
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'apps.persons.backends.CustomElasticsearchSearchEngine',
'URL': 'http://127.0.0.1:9200/',
'INDEX_NAME': 'haystack',
},
}
ELASTICSEARCH_INDEX_SETTINGS = {
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "custom_ngram_tokenizer",
"filter": ["asciifolding", "lowercase"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "custom_edgengram_tokenizer",
"filter": ["asciifolding", "lowercase"]
}
},
"tokenizer": {
"custom_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 12,
"token_chars": ["letter", "digit"]
},
"custom_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 12,
"token_chars": ["letter", "digit"]
}
}
}
}
}
HAYSTACK_DEFAULT_OPERATOR = 'AND'
e) use AutoQuery (more versatile):
# views.py
search_value = 'Simons'
...
person_sqs = \
SearchQuerySet().models(Person).filter(
content=AutoQuery(search_value)
)
f) reindex after changes:
$ ./manage.py rebuild_index
I was facing a similar problem. if you change the settings of your haystacks elasticsearch back end like:
DEFAULT_SETTINGS = {
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["haystack_ngram", "lowercase"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["haystack_edgengram", "lowercase"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 6,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 6,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 6,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 6,
"max_gram": 15
}
}
}
}
}
Then it will tokenize only when the query is more than 6 character.
If you want results like "xyzsimonsxyz", then you would need to use ngram analyzer instead of EdgeNGram or you could use both depending on your requirements. EdgeNGram generates tokens only from the beginning.
with NGram 'simons' will be one of the generated tokens for term xyzsimonsxyz assuming max_gram >=6 and you will get expected results, also search_analyzer needs to be different or you will get weird results.
Also index size might get pretty big with ngram if you have huge chunk of text
Not use CharField use EdgeNgramField.
# search_indexes.py
class PersonIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
name = indexes.EdgeNgramField(model_attr="name")
def get_model(self):
return Person
def index_queryset(self, using=None):
return self.get_model().objects.all()
And not user filter, user autocomplete
person_sqs = SearchQuerySet().models(Person)
person_sqs.autocomplete(name="Simons")
source: http://django-haystack.readthedocs.org/en/v2.0.0/autocomplete.html