I'm using django-haystack and ElasticSearch to index Stores.
Until now, each store had one lat,long coordinate pair; we had to change this to represent the fact that one store can deliver products to very different regions (disjunct) I've added up to ten locations (lat,long pairs) to them.
When using one location field everything was working fine and I got right results. Now, with multiple location fields, I can't get any results, not even the previuos one, for the same user and store coordinates.
My Index is as following:
class StoreIndex(indexes.SearchIndex,indexes.Indexable):
text = indexes.CharField(document=True, use_template=True,
template_name='search/indexes/store/store_text.txt')
location0 = indexes.LocationField()
location1 = indexes.LocationField()
location2 = indexes.LocationField()
location3 = indexes.LocationField()
location4 = indexes.LocationField()
location5 = indexes.LocationField()
location6 = indexes.LocationField()
location7 = indexes.LocationField()
location8 = indexes.LocationField()
location9 = indexes.LocationField()
def get_model(self):
return Store
def prepare_location0(self, obj):
# If you're just storing the floats...
return "%s,%s" % (obj.latitude, obj.longitude)
# ..... up to prepare_location9
def prepare_location9(self, obj):
# If you're just storing the floats...
return "%s,%s" % (obj.latitude_9, obj.longitude_9)
Is this the correct way to build my index?
From elasticsearch I get this mapping information:
curl -XGET http://localhost:9200/stores/_mapping?pretty=True
{
"stores" : {
"modelresult" : {
"properties" : {
"django_id" : {
"type" : "string"
},
"location0" : {
"type" : "geo_point",
"store" : "yes"
},
"location1" : {
"type" : "geo_point",
"store" : "yes"
},
"location2" : {
"type" : "geo_point",
"store" : "yes"
},
"location3" : {
"type" : "geo_point",
"store" : "yes"
},
"location4" : {
"type" : "geo_point",
"store" : "yes"
},
"location5" : {
"type" : "geo_point",
"store" : "yes"
},
"location6" : {
"type" : "geo_point",
"store" : "yes"
},
"location7" : {
"type" : "geo_point",
"store" : "yes"
},
"location8" : {
"type" : "geo_point",
"store" : "yes"
},
"location9" : {
"type" : "geo_point",
"store" : "yes"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"store" : "yes",
"term_vector" : "with_positions_offsets"
}
}
}
}
}
Then, I try to query this way:
sqs0 = SearchQuerySet().dwithin('location0', usuario, max_dist).distance('location0',usuario).using('stores')
where:
usuario is a Point instance representing the user trying to find stores near his position and
max_dist is a D instance.
If I query directly, using curl I got no results, too.
Here is the result of quering using curl with multiple location fields:
$ curl -XGET http://localhost:9200/stores/modelresult/_search?pretty=true -d '{ "query" : { "match_all": {} }, "filter" : {"geo_distance" : { "distance" : "6km", "location0" : { "lat" : -23.5, "lon" : -46.6 } } } } '
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
If comment out the fields location1-9 from the StoreIndex class everything works fine, but if I leave them to get multiple location points, I get no results for the same query (user position). This happens for the same query, in django as directly, using curl. That is, if I have only one location (say location0), both queries returns correct results. With more locations (location0-9), both queries didn't give any results.
Here's the results of quering directly using curl with only one location field:
$ curl -XGET http://localhost:9200/stores/modelresult/_search?pretty=true -d '{ "query" : { "match_all": {} }, "filter" : {"geo_distance" : { "distance" : "6km", "location0" : { "lat" : -23.5, "lon" : -46.6 } } } } '
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9,
"max_score" : 1.0,
"hits" : [ {
"_index" : "stores",
"_type" : "modelresult",
"_id" : "store.store.110",
"_score" : 1.0, "_source" : {"django_ct": "store.store", "text": "RESULT OF THE SEARCH \n\n", "django_id": "110", "id": "store.store.110", "location0": "-23.4487554,-46.58912"}
},
lot's of results here
]
}
}
Of course, I rebuild_index after any change in StoreIndex.
Any help on how to get multiple location fields working with elasticsearch and django?
PS.: I've cross posted this question on Django-Haystack and ElasticSearch Google Groups.
https://groups.google.com/d/topic/elasticsearch/85fg7vdCBBU/discussion
https://groups.google.com/d/topic/django-haystack/m2A3_SF8-ls/discussion
Thanks in advance
Mário
Related
I'm working on a project containing django, elasticsearch and django-elasticsearch-dsl. I'm collecting a quite large amount of data and saving it to postgres and indexing it to elasticsearch, via django-elasticsearch-dsl.
Im bumping into a problem I dont understant, nor do I have any further hints what happens:
Relevant part of Django's models.py file:
class LinkDenorm(BaseModel):
...
link = CharField(null=True, max_length=2710, db_index=True)
link_expanded = TextField(null=True, db_index=True)
title = TextField(null=True, db_index=True)
text = TextField(null=True)
...
Relevant part of django-elasticsearch-dsl documents.py file:
#registry.register_document
class LinkDenorm(Document):
link_expanded = fields.KeywordField(attr='link_expanded')
class Index:
name = 'denorms_v10'
class Django:
model = models.LinkDenorm
fields = [
...
'link',
'title',
'text',
...
]
After data is successfully indexed, I verify that the index is containing the correct fields:
curl -X GET -u <myuser>:<mypasswd> "http://<my-hostname>/denorms_v10/?pretty"
{
"denorms_v10" : {
"mappings" : {
"properties" : {
...
"link" : {
"type" : "text"
},
"title" : {
"type" : "text"
},
"text" : {
"type" : "text"
}
"link_expanded" : {
"type" : "keyword"
},
...
}
}
}
}
After a certain amount of time (sometimes weeks, sometimes days) the index fields are changed. Executing the same CURL lookup as before gives me:
curl -X GET -u <myuser>:<mypasswd> "http://<my-hostname>/denorms_v10/?pretty"
{
"denorms_v10" : {
"mappings" : {
"properties" : {
...
"link" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"link_expanded" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
...
}
}
}
}
After the change happens, the queries fail, since the datatype is not correct. After investigating elasticsearch and django logs, there is nothing that would give a clue what happens with the index.
I'm a bit lost and running out of ideas. Any suggestions are most welcome. Thank you!
Miha, Your index probably use kind of an ILM without any index template.
Either you query an alias, and aliases under that are changing.
Either a process on your side delete regularly the index (depending on his size or the number of documents in it)
Then when your app do a post again it recreate an index with default Elastic mapping.
I'm trying to filter a query with term and range along with query-string. filter(range) and query string works but not filter(term). am i doing something wrong?
es = Elasticsearch([{'host': '192.168.121.121', 'port': 9200}])
index = Index("filebeat-*",using=es)
search = index.search()
searchStr = "OutOfMemoryError"
search = search.query("query_string", query=searchStr)
search = search.filter('range' , **{'#timestamp': {'gte': 1589399137000 , 'lt': 1589399377000, 'format' : 'epoch_millis'}})
search = search.filter('term' , **{'can.deployment': 'can-*' })
response = search.execute(ignore_cache=True)
print(response.hits.total)
print(response.hits.hits._source.can.deployment)
json:
filter-term - ['hits']['hits']['_source']['can']['deployment']
filter-range- ['hits']['hits']['_source']['#timestamp']
{
"hits" : {
"total" : 138351328,
"max_score" : 6.5700893,
"hits" : [
{
"_index" : "filebeat-6.1.2-2020.05.13",
"_type" : "doc",
"_score" : 2.0166037,
"_source" : {
"#timestamp" : "2020-05-13T01:14:03.354Z",
"source" : "/var/log/gw_rest/gw_rest.log",
"message" : "[2020-05-13 01:14:03.354] WARN can_gw_rest [EventLoopGroup-3-2]: An exceptionCaught() event was fired.OutOfMemoryError,
"fileset" : {...},
"can" : {
"level" : "WARN",
>>>>>>>> "message" : "An exceptionCaught() event was fired- OutOfMemoryError,
"timestamp" : "2020-05-13 01:14:03.354",
>>>>>>>> "deployment" : "can-6b721b93965b-w3we4-4074-9903"
}
}
}
]
}
}
I actually didn't need a filter(term). this worked:
dIds=response['hits']['hits'][1]['_source']['can']['deployment']
print(dIds)
#loop through the response
for i in response['hits']['hits']:
id = i['_source']['can']['deployment']
print(id)
Transitioning to new AWS documentDB service. Currently, on Mongo 3.2. When I run db.collection.distinct("FIELD_NAME") it returns the results really quickly. I did a database dump to AWS document DB (Mongo 3.6 compatible) and this simple query just gets stuck.
Here's my .explain() and the indexes on the working instance versus AWS documentdb:
Explain function on working instance:
> db.collection.explain().distinct("FIELD_NAME")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"_id" : 0,
"FIELD_NAME" : 1
},
"inputStage" : {
"stage" : "DISTINCT_SCAN",
"keyPattern" : {
"FIELD_NAME" : 1
},
"indexName" : "FIELD_INDEX_NAME",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"FIELD_NAME" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
Explain on AWS documentdb, not working:
rs0:PRIMARY> db.collection.explain().distinct("FIELD_NAME")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.collection",
"winningPlan" : {
"stage" : "AGGREGATE",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "COLLSCAN"
}
}
}
},
}
Index on both of these instances:
{
"v" : 1,
"key" : {
"FIELD_NAME" : 1
},
"name" : "FIELD_INDEX_NAME",
"ns" : "db.collection"
}
Also, this database has a couple million documents but there are only about 20 distinct values for that "FIELD_NAME". Any help would be appreciated.
I tried it with .hint("index_name") and that didn't work. I tried clearing plan cache but I get Feature not supported: planCacheClear
COLLSCAN and IXSCAN don't have too much difference in this case, both need to scan all the documents or index entries.
I have the following DynamoDB mapping template, to update an existing DynamoDB item:
{
"version" : "2017-02-28",
"operation" : "UpdateItem",
"key" : {
"id": $util.dynamodb.toDynamoDBJson($ctx.args.application.id),
"tenant": $util.dynamodb.toDynamoDBJson($ctx.identity.claims['http://domain/tenant'])
},
"update" : {
"expression" : "SET #sourceUrl = :sourceUrl, #sourceCredential = :sourceCredential, #instanceSize = :instanceSize, #users = :users",
"expressionNames" : {
"#sourceUrl" : "sourceUrl",
"#sourceCredential" : "sourceCredential",
"#instanceSize" : "instanceSize",
"#users" : "users"
},
"expressionValues" : {
":sourceUrl" : $util.dynamodb.toDynamoDbJson($ctx.args.application.sourceUrl),
":sourceCredential" : $util.dynamodb.toDynamoDbJson($ctx.args.application.sourceCredential),
":instanceSize" : $util.dynamodb.toDynamoDbJson($ctx.args.application.instanceSize),
":users" : $util.dynamodb.toDynamoDbJson($ctx.args.application.users)
}
},
"condition" : {
"expression" : "attribute_exists(#id) AND attribute_exists(#tenant)",
"expressionNames" : {
"#id" : "id",
"#tenant" : "tenant"
}
}
}
But I'm getting the following error:
message: "Unable to parse the JSON document: 'Unrecognized token '$util': was expecting ('true', 'false' or 'null')↵ at [Source: (String)"{↵ "version" : "2017-02-28",↵ "operation" : "UpdateItem",↵ "key" : {↵ "id": {"S":"abc-123"},↵ "tenant": {"S":"test"}↵ },↵ "update" : {↵ "expression" : "SET #sourceUrl = :sourceUrl, #sourceCredential = :sourceCredential, #instanceSize = :instanceSize, #users = :users",↵ "expressionNames" : {↵ "#sourceUrl" : "sourceUrl",↵ "#sourceCredential" : "sourceCredential",↵ "#instanceSize" : "instanceSize",↵ "#users" : "users"↵ }"[truncated 400 chars]; line: 17, column: 29]'"
I've tried removing parts, and it seems to be related to the expressionValues, but I can't see anything wrong with the syntax.
Seems like you misspelled the toDynamoDBJson method
Replace
$util.dynamodb.toDynamoDbJson($ctx.args.application.sourceUrl)
with
$util.dynamodb.toDynamoDBJson($ctx.args.application.sourceUrl)
Note the uppercase B in toDynamoDBJson.
I have the following query:
{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "dog cat",
"analyzer" : "standard",
"default_operator" : "AND",
"fields" : ["title", "content"]
}
},
{
"range" : {
"dateCreate" : {
"gte" : "2018-07-01T00:00:00+0200",
"lte" : "2018-07-31T23:59:59+0200"
}
}
},
{
"regexp" : {
"articleIds" : {
"value" : ".*?(2561|30|540).*?",
"boost" : 1
}
}
}
]
}
}
}
The fields title, content and articleIds are of type text, dateCreate is of type date. The articleIds field contains some IDs (comma-separated).
Ok, what happens now? I execute the query an get two results: Both documents contain the words "dog" and "cat" in the title or in the content. So far it's correct.
But the second result has the number 3507 in the articleIds field which doesn't match to my query. It seems that the regexp is ignored because title and content already match. What is wrong here?
And here's the document that should not match my query but does:
{
"_index" : "example",
"_type" : "doc",
"_id" : "3007780",
"_score" : 21.223656,
"_source" : {
"dateCreate" : "2018-07-13T16:54:00+0200",
"title" : "",
"content" : "Its raining cats and dogs.",
"articleIds" : "3507"
}
}
And what I'm expecting is that this document should not be in the results because it contains 3507 which is not part of my query...