Basic geosearch with ElasticSearch - amazon-web-services

I'm putting together a proof of concept on AWS using Dynamo and the Amazon ElasticSearch service, and I'm having some trouble getting
I've checked the ES Dashboard and see the following....
I have an index [assets] and a mapping [asset_types]. Below is a sample of some the mappings, with the relevant location
filename *string*
checksum *string*
added_date *date*
General [this is a map]
location
lat *string*
lon *string*
make *string*
model *string*
I want the geo searches to be on the "General.location" field. I've tried a couple different queries so far without any luck, but I'm sure I'm missing something rather obvious.
One is from the official documentation here ,modified to the below which results with this error:
"reason": "failed to parse search source. unknown search element [bool]",
POST assets/_search
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "200km",
"General.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
I've also tried a slightly different query which raises ""reason": "failed to find geo_point field [General.location]"
POST assets/_search
{
"filter" : {
"geo_distance" : {
"distance" : "1km",
"General.location" : {
"lat" : 40,
"lon" : -70
}
}
},
"query" : {
"match_all" : {}
}
}
Am I running a query incorrectly? Do I need to update the mapping in the index to specify the geo-index? I thought if I formatted fields properly that wasn't a requirement.
Thanks

The issue lies in your mapping, where your General.location field is not properly mapped. That's the reason you get the error failed to find geo_point field
So instead of
General [this is a map]
location
lat *string*
lon *string*
You need to have
General [this is a map]
location *geo_point*
So you need to modify your mapping accordingly and reindex your data.
The second issue you have is that your first query needs to be enclosed in a query section:
POST assets/_search
{
"query" : {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "200km",
"General.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
}
Once you've fixed both issues, you'll be able to run your query successfully.

In addition to what Val said, I created a new mapping for location
I explicitly created a mapping for this. Note for you other novices out there, I needed to use nested properties update in order to create the "General.deviceLocation". After I did this, Val's update query worked.
PUT assets/_mapping/assets_type
{
"properties": {
"General": {
"properties": {
"deviceLocation": {
"type": "geo_point"
}
}
},
}
}

Related

Is there an node like statement available in mongodb

I want to filter the decimal value in child array of json file.In below sample code i want to apply the like function to get the json value like(t1,t2) in below sample file.
Sample code:
db.getCollection('temp').find({},{"temp.text./.*t.*/.value":1})
Sample Json file:
{
"_id" :0"),
"temp" : {
"text" : {
"t1" : {
"value" : "960"
},
"t2" : {
"value" : "959"
},
"t3" : {
"value" : "961"
},
"t4" : {
"value" : "962"
},
"t5" : {
"value" : "6.0"
}
}
}
}
MongoDB doesn't have a way to filter field names directly other than projection, which is exact match only.
However, using aggregation you can use $objectToArray, which would convert the object {"t1" : {"value" : "960"}} to [{"k":"t1","v":{"value":"960"}}]. You can then filter based on the value of k, and use $arrayToObject to convert the entries left back into an object.
.aggregate([
{$addFields:{
"temp.text":{
$arrayToObject:{
$filter:{
input:{$objectToArray:"$temp.text"},
cond:{
$regexMatch:{
input:"$$this.k",
regex:/t/
}
}
}
}
}
}}
])
Playground

Ordering a term aggregation with a multi-bucket sub-aggregation

Given a term aggregation (label), I would like to sort the bucket by a string field (energy).
The problem is that we cannot use a multi-bucket value in the order clause.
For a given label, I'm sure that there is only one energy. What I would like to do is to use the first (and only) result of my energy sub aggregation.
I'm using the AWS elasticsearch service which is in a 1.5 version, and scripts are disabled, so I did not find a way to sort the bucket by another term :(
Any idea ?
{
"aggs" : {
"label" : {
"terms" : { "field" : "label" },
"order" : { "energy[0]" : "desc" } // cannot do this
},
"aggs" : {
"energy" : {
"terms" : {
"field" : "energy",
"size" : 1
}
}
}
}
}

Implement auto-complete feature using MongoDB search

I have a MongoDB collection of documents of the form
{
"id": 42,
"title": "candy can",
"description": "canada candy canteen",
"brand": "cannister candid",
"manufacturer": "candle canvas"
}
I need to implement auto-complete feature based on the input search term by matching in the fields except id. For example, if the input term is can, then I should return all matching words in the document as
{ hints: ["candy", "can", "canada", "canteen", ...]
I looked at this question but it didn't help. I also tried searching how to do regex search in multiple fields and extract matching tokens, or extracting matching tokens in a MongoDB text search but couldn't find any help.
tl;dr
There is no easy solution for what you want, since normal queries can't modify the fields they return. There is a solution (using the below mapReduce inline instead of doing an output to a collection), but except for very small databases, it is not possible to do this in realtime.
The problem
As written, a normal query can't really modify the fields it returns. But there are other problems. If you want to do a regex search in halfway decent time, you would have to index all fields, which would need a disproportional amount of RAM for that feature. If you wouldn't index all fields, a regex search would cause a collection scan, which means that every document would have to be loaded from disk, which would take too much time for autocompletion to be convenient. Furthermore, multiple simultaneous users requesting autocompletion would create considerable load on the backend.
The solution
The problem is quite similar to one I have already answered: We need to extract every word out of multiple fields, remove the stop words and save the remaining words together with a link to the respective document(s) the word was found in a collection. Now, for getting an autocompletion list, we simply query the indexed word list.
Step 1: Use a map/reduce job to extract the words
db.yourCollection.mapReduce(
// Map function
function() {
// We need to save this in a local var as per scoping problems
var document = this;
// You need to expand this according to your needs
var stopwords = ["the","this","and","or"];
for(var prop in document) {
// We are only interested in strings and explicitly not in _id
if(prop === "_id" || typeof document[prop] !== 'string') {
continue
}
(document[prop]).split(" ").forEach(
function(word){
// You might want to adjust this to your needs
var cleaned = word.replace(/[;,.]/g,"")
if(
// We neither want stopwords...
stopwords.indexOf(cleaned) > -1 ||
// ...nor string which would evaluate to numbers
!(isNaN(parseInt(cleaned))) ||
!(isNaN(parseFloat(cleaned)))
) {
return
}
emit(cleaned,document._id)
}
)
}
},
// Reduce function
function(k,v){
// Kind of ugly, but works.
// Improvements more than welcome!
var values = { 'documents': []};
v.forEach(
function(vs){
if(values.documents.indexOf(vs)>-1){
return
}
values.documents.push(vs)
}
)
return values
},
{
// We need this for two reasons...
finalize:
function(key,reducedValue){
// First, we ensure that each resulting document
// has the documents field in order to unify access
var finalValue = {documents:[]}
// Second, we ensure that each document is unique in said field
if(reducedValue.documents) {
// We filter the existing documents array
finalValue.documents = reducedValue.documents.filter(
function(item,pos,self){
// The default return value
var loc = -1;
for(var i=0;i<self.length;i++){
// We have to do it this way since indexOf only works with primitives
if(self[i].valueOf() === item.valueOf()){
// We have found the value of the current item...
loc = i;
//... so we are done for now
break
}
}
// If the location we found equals the position of item, they are equal
// If it isn't equal, we have a duplicate
return loc === pos;
}
);
} else {
finalValue.documents.push(reducedValue)
}
// We have sanitized our data, now we can return it
return finalValue
},
// Our result are written to a collection called "words"
out: "words"
}
)
Running this mapReduce against your example would result in db.words look like this:
{ "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Note that the individual words are the _id of the documents. The _id field is indexed automatically by MongoDB. Since indices are tried to be kept in RAM, we can do a few tricks to both speed up autocompletion and reduce the load put to the server.
Step 2: Query for autocompletion
For autocompletion, we only need the words, without the links to the documents.
Since the words are indexed, we use a covered query – a query answered only from the index, which usually resides in RAM.
To stick with your example, we would use the following query to get the candidates for autocompletion:
db.words.find({_id:/^can/},{_id:1})
which gives us the result
{ "_id" : "can" }
{ "_id" : "canada" }
{ "_id" : "candid" }
{ "_id" : "candle" }
{ "_id" : "candy" }
{ "_id" : "cannister" }
{ "_id" : "canteen" }
{ "_id" : "canvas" }
Using the .explain() method, we can verify that this query uses only the index.
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 8,
"nscannedObjects" : 0,
"nscanned" : 8,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 8,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
"can",
"cao"
],
[
/^can/,
/^can/
]
]
},
"server" : "32a63f87666f:27017",
"filterSet" : false
}
Note the indexOnly:true field.
Step 3: Query the actual document
Albeit we will have to do two queries to get the actual document, since we speed up the overall process, the user experience should be well enough.
Step 3.1: Get the document of the words collection
When the user selects a choice of the autocompletion, we have to query the complete document of words in order to find the documents where the word chosen for autocompletion originated from.
db.words.find({_id:"canteen"})
which would result in a document like this:
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Step 3.2: Get the actual document
With that document, we can now either show a page with search results or, like in this case, redirect to the actual document which you can get by:
db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})
Notes
While this approach may seem complicated at first (well, the mapReduce is a bit), it is actual pretty easy conceptually. Basically, you are trading real time results (which you won't have anyway unless you spend a lot of RAM) for speed. Imho, that's a good deal. In order to make the rather costly mapReduce phase more efficient, implementing Incremental mapReduce could be an approach – improving my admittedly hacked mapReduce might well be another.
Last but not least, this way is a rather ugly hack altogether. You might want to dig into elasticsearch or lucene. Those products imho are much, much more suited for what you want.
Thanks to #Markus solution, I came up with something similar with aggregations instead. Knowing that map-reduce are flagged as deprecated for later versions.
const { MongoDBNamespace, Collection } = require('mongodb')
//.replace(/(\b(\w{1,3})\b(\W|$))/g,'').split(/\s+/).join(' ')
const routine = `function (text) {
const stopwords = ['the', 'this', 'and', 'or', 'id']
text = text.replace(new RegExp('\\b(' + stopwords.join('|') + ')\\b', 'g'), '')
text = text.replace(/[;,.]/g, ' ').trim()
return text.toLowerCase()
}`
// If the pipeline includes the $out operator, aggregate() returns an empty cursor.
const agg = [
{
$match: {
a: true,
d: false,
},
},
{
$project: {
title: 1,
desc: 1,
},
},
{
$replaceWith: {
_id: '$_id',
text: {
$concat: ['$title', ' ', '$desc'],
},
},
},
{
$addFields: {
cleaned: {
$function: {
body: routine,
args: ['$text'],
lang: 'js',
},
},
},
},
{
$replaceWith: {
_id: '$_id',
text: {
$trim: {
input: '$cleaned',
},
},
},
},
{
$project: {
words: {
$split: ['$text', ' '],
},
qt: {
$const: 1,
},
},
},
{
$unwind: {
path: '$words',
includeArrayIndex: 'id',
preserveNullAndEmptyArrays: true,
},
},
{
$group: {
_id: '$words',
docs: {
$addToSet: '$_id',
},
weight: {
$sum: '$qt',
},
},
},
{
$sort: {
weight: -1,
},
},
{
$limit: 100,
},
{
$out: {
db: 'listings_db',
coll: 'words',
},
},
]
// Closure for db instance only
/**
*
* #param { MongoDBNamespace } db
*/
module.exports = function (db) {
/** #type { Collection } */
let collection
/**
* Runs the aggregation pipeline
* #return {Promise}
*/
this.refreshKeywords = async function () {
collection = db.collection('listing')
// .toArray() to trigger the aggregation
// it returns an empty curson so it's fine
return await collection.aggregate(agg).toArray()
}
}
Please check for very minimal changes for your convenience.

MongoDB Query For Fields That Vary - Wildcards?

I am looking for a way to get distinct "unit" values from a collection that has a structure similar to the following:
{
"_id" : ObjectId("548b1aee6e444414f00d5cf1"),
"KPI" : {
"NPV" : {
"value" : 100,
"unit" : "kUSD"
},
"NPM" : {
"value" : 100,
"unit" : "kUSD"
},
"GPM" : {
"value" : 50,
"unit" : "CAD"
}
}
}
I looked into using wildcards and regex but from what I have come across this is not supported for field matching. I would like to do something like db.collection.distinct('KPI.*.unit') but cannot determine how and it seems like performance would be poor. Does anyone have a recommendation? Thanks.
It's not a good practice to make the keys a part of the content of the document - don't use keys as data. If you don't change your document structure, you'll need to know what the possible subfields of KPI are. If you don't know what those could be, you will need to examine the documents manually to find them. Then you can issue a distinct for each using dot notation, e.g. db.collection.distinct("KPI.NPM.unit").
If what you're looking for instead is the distinct values of unit across all values of the parent KPI subfield, then you could take the union of all of the results of the distincts. You can also do it easily with an aggregation framework in MongoDB 2.6. For simplicity, I'll assume there's just three distinct subfields of KPI, the ones in the document above.
db.collection.aggregate([
{ "$group" : { "_id" : 0, "NPVunits" : { "$addToSet" : "$KPI.NPV.unit" }, "NPMunits" : { "$addToSet" : "$KPI.NPM.unit" }, "GPMunits" : { "$addToSet" : "$KPI.GPM.unit" } }
{ "$project" : { "distinct_units" : { "$setUnion" : ["$NPVunits", "$NPMunits", "$GPMunits"] } } }
])
You could also structure your data as dynamic attributes. The document above would be recast as something like
{
"_id" : ObjectId("548b1aee6e444414f00d5cf1"),
"KPI" : [
{ "type" : "NPV", "value" : 100, "unit" : "kUSD" },
{ "type" : "NPM", "value" : 100, "unit" : "kUSD" },
{ "type" : "GPM", "value" : 50, "unit" : "CAD" }
]
}
Querying for distinct units is easy now, whether you want it per type or over all types:
Per type (all types in one query)
db.collection.aggregate([
{ "$unwind" : "$KPI" },
{ "$group" : { "_id" : "$KPI.type", "units" : { "$addToSet" : "$KPI.unit" } } }
])
Over all types
db.collection.distinct("KPI.unit")

combining regex and embedded objects in mongodb queries

I am trying to combine regex and embedded object queries and failing miserably. I am either hitting a limitation of mongodb or just getting something slightly wrong maybe someone out ther has encountered this. The documentation certainly does'nt cover this case.
data being queried:
{
"_id" : ObjectId("4f94fe633004c1ef4d892314"),
"productname" : "lightbulb",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
{
"country" : "USA",
"storeCode" : "xzy-6784"
},
{
"country" : "USA",
"storeCode" : "abc-3454"
},
{
"country" : "CANADA",
"storeCode" : "abc-6845"
}
]
}
assume the collection contains only one record
This query returns 1:
db.testCol.find({"availability":{"country" : "USA","storeCode":"xzy-6784"}}).count();
This query returns 1:
db.testCol.find({"availability.storeCode":/.*/}).count();
But, this query returns 0:
db.testCol.find({"availability":{"country" : "USA","storeCode":/.*/}}).count();
Does anyone understand why? Is this a bug?
thanks
You are referencing the embedded storecode incorrectly - you are referencing it as an embedded object when in fact what you have is an array of objects. Compare these results:
db.testCol.find({"availability.0.storeCode":/x/});
db.testCol.find({"availability.0.storeCode":/a/});
Using your sample doc above, the first one will not return, because the first storeCode does not have an x in it ("abc-1234"), the second will return the document. That's fine for the case where you are looking at a single element of the array and pass in the position. In order to search all of the objcts in the array, you want $elemMatch
As an example, I added this second example doc:
{
"_id" : ObjectId("4f94fe633004c1ef4d892315"),
"productname" : "hammer",
"availability" : [
{
"country" : "USA",
"storeCode" : "abc-1234"
},
]
}
Now, have a look at the results of these queries:
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/a/}}}).count();
2
PRIMARY> db.testCol.find({"availability" : {$elemMatch : {"storeCode":/x/}}}).count();
1