AWS Elasticsearch - Query Cache - amazon-web-services

I am using AWS Elasticsearch and the cluster receives ~ 600 search queries per second. This is causing periodic bursts of 503 Service not available response from Elasticsearch. As, a result I wanted to turn on the cache query for the index (Verified that is it actually turned on by looking at <ES_DOMAIN>/<INDEX_NAME>
However, when I check the query cache stats at <ES_DOMAIN>/_stats/query_cache?pretty&human, this is what I get
"<index_name>" : {
"primaries" : {
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0,
"hit_count" : 0,
"miss_count" : 0
}
},
"total" : {
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0,
"hit_count" : 0,
"miss_count" : 0
}
}
}
Any suggestions on how I can turn on the cache ?

Based on my reading and similar experience (even after setting index.cache.query.enable: true in the index mapping) I can only guess that AWS has disabled query caching. Probably by setting indices.cache.query.size: 0% in config/elasticsearch.yml
UPDATE
After leaving the cluster running for a while and doing some heavy aggregations I am seeing that the query_cache is starting to get used, although not sure why I am not seeing any cache hits
GET _nodes/stats/indices/query_cache?pretty&human
{
"cluster_name": "XXXXXXXXXXXX:xxxxxxxxxxx",
"nodes": {
"q59YfHDdRQupousO9vh6KQ": {
"timestamp": 1465589579698,
"name": "Mongoose",
"indices": {
"query_cache": {
"memory_size": "37.2kb",
"memory_size_in_bytes": 38151,
"evictions": 0,
"hit_count": 0,
"miss_count": 45
}
}
},
"K3olMnkkRZW53tTw05UVhA": {
"timestamp": 1465589579692,
"name": "Meggan Braddock",
"indices": {
"query_cache": {
"memory_size": "47.3kb",
"memory_size_in_bytes": 48497,
"evictions": 0,
"hit_count": 0,
"miss_count": 53
}
}
}
}
}

Related

Mongodb db.collection.distinct() on aws documentdb doesn't use index

Transitioning to new AWS documentDB service. Currently, on Mongo 3.2. When I run db.collection.distinct("FIELD_NAME") it returns the results really quickly. I did a database dump to AWS document DB (Mongo 3.6 compatible) and this simple query just gets stuck.
Here's my .explain() and the indexes on the working instance versus AWS documentdb:
Explain function on working instance:
> db.collection.explain().distinct("FIELD_NAME")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"_id" : 0,
"FIELD_NAME" : 1
},
"inputStage" : {
"stage" : "DISTINCT_SCAN",
"keyPattern" : {
"FIELD_NAME" : 1
},
"indexName" : "FIELD_INDEX_NAME",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"FIELD_NAME" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
Explain on AWS documentdb, not working:
rs0:PRIMARY> db.collection.explain().distinct("FIELD_NAME")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.collection",
"winningPlan" : {
"stage" : "AGGREGATE",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "COLLSCAN"
}
}
}
},
}
Index on both of these instances:
{
"v" : 1,
"key" : {
"FIELD_NAME" : 1
},
"name" : "FIELD_INDEX_NAME",
"ns" : "db.collection"
}
Also, this database has a couple million documents but there are only about 20 distinct values for that "FIELD_NAME". Any help would be appreciated.
I tried it with .hint("index_name") and that didn't work. I tried clearing plan cache but I get Feature not supported: planCacheClear
COLLSCAN and IXSCAN don't have too much difference in this case, both need to scan all the documents or index entries.

Date Range is not working in AWS ElasticSearch 6.0

I'm new to elastic search. I'm use elastic search to do some queries.
I want to find the products within a time window.
Here is the query:
Here is the mapping:
config = {
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"mapping":{
"topselling":{
"properties":{
"order_id":{"type":"long","store":"yes","index":"no"},
"product":{"type":"string","store":"yes","index":"no"},
"created_at":{"type":"date","store":"yes", "format":"yyyy-MM-DD HH:mm:ss", "index":"analyzed"},
}
}
}
}
query={
"from" : 0,
"size" : 30,
"query":{
"range":{
"created_at":{
"gte":"2010-01-27 02:47:19",
"lte":"2010-01-27 23:16:59",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
This is the code to search elasticsearch
response = es.search(index="topselling", doc_type="topselling", body=query)
I'm sure the data is in the cluster as I can get the data by query id.
Thanks!

Using Elastic Search Geo Functionality To Find Most Common Locations?

I have a geojson file containing a list of locations each with a longitude, latitude and timestamp. Note the longitudes and latitudes are multiplied by 10000000.
{
"locations" : [ {
"timestampMs" : "1461820561530",
"latitudeE7" : -378107308,
"longitudeE7" : 1449654070,
"accuracy" : 35,
"junk_i_want_to_save_but_ignore" : [ { .. } ]
}, {
"timestampMs" : "1461820455813",
"latitudeE7" : -378107279,
"longitudeE7" : 1449673809,
"accuracy" : 33
}, {
"timestampMs" : "1461820281089",
"latitudeE7" : -378105184,
"longitudeE7" : 1449254023,
"accuracy" : 35
}, {
"timestampMs" : "1461820155814",
"latitudeE7" : -378177434,
"longitudeE7" : 1429653949,
"accuracy" : 34
}
..
Many of these locations will be the same physical location (e.g. the user's home) but obviously the longitude and latitudes may not be exactly the same.
I would like to use Elastic Search and it's Geo functionality to produce a ranked list of most common locations where locations are deemed to be the same if they are within, say, 100m of each other?
For each common location I'd also like the list of all timestamps they were at that location if possible!
I'd very much appreciate a sample query to get me started!
Many thanks in advance.
In order to make it work you need to modify your mapping like this:
PUT /locations
{
"mappings": {
"location": {
"properties": {
"location": {
"type": "geo_point"
},
"timestampMs": {
"type": "long"
},
"accuracy": {
"type": "long"
}
}
}
}
}
Then, when you index your documents, you need to divide the latitude and longitude by 10000000, and index like this:
PUT /locations/location/1
{
"timestampMs": "1461820561530",
"location": {
"lat": -37.8103308,
"lon": 14.4967407
},
"accuracy": 35
}
Finally, your search query below...
POST /locations/location/_search
{
"aggregations": {
"zoomedInView": {
"filter": {
"geo_bounding_box": {
"location": {
"top_left": "-37, 14",
"bottom_right": "-38, 15"
}
}
},
"aggregations": {
"zoom1": {
"geohash_grid": {
"field": "location",
"precision": 6
},
"aggs": {
"ts": {
"date_histogram": {
"field": "timestampMs",
"interval": "15m",
"format": "DDD yyyy-MM-dd HH:mm"
}
}
}
}
}
}
}
}
...will yield the following result:
{
"aggregations": {
"zoomedInView": {
"doc_count": 1,
"zoom1": {
"buckets": [
{
"key": "k362cu",
"doc_count": 1,
"ts": {
"buckets": [
{
"key_as_string": "Thu 2016-04-28 05:15",
"key": 1461820500000,
"doc_count": 1
}
]
}
}
]
}
}
}
}
UPDATE
According to our discussion, here is a solution that could work for you. Using Logstash, you can call your API and retrieve the big JSON document (using the http_poller input), extract/transform all locations and sink them to Elasticsearch (with the elasticsearch output) very easily.
Here is how it goes in order to format each event as depicted in my initial answer.
Using http_poller you can retrieve the JSON locations (note that I've set the polling interval to 1 day, but you can change that to some other value, or simply run Logstash manually each time you want to retrieve the locations)
Then we split the locations array into individual events
Then we divide the latitude/longitude fields by 10,000,000 to get proper coordinates
We also need to clean it up a bit by moving and removing some fields
Finally, we just send each event to Elasticsearch
Logstash configuration locations.conf:
input {
http_poller {
urls => {
get_locations => {
method => get
url => "http://your_api.com/locations.json"
headers => {
Accept => "application/json"
}
}
}
request_timeout => 60
interval => 86400000
codec => "json"
}
}
filter {
split {
field => "locations"
}
ruby {
code => "
event['location'] = {
'lat' => event['locations']['latitudeE7'] / 10000000.0,
'lon' => event['locations']['longitudeE7'] / 10000000.0
}
"
}
mutate {
add_field => {
"timestampMs" => "%{[locations][timestampMs]}"
"accuracy" => "%{[locations][accuracy]}"
"junk_i_want_to_save_but_ignore" => "%{[locations][junk_i_want_to_save_but_ignore]}"
}
remove_field => [
"locations", "#timestamp", "#version"
]
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "locations"
document_type => "location"
}
}
You can then run with the following command:
bin/logstash -f locations.conf
When that has run, you can launch your search query and you should get what you expect.

Implement auto-complete feature using MongoDB search

I have a MongoDB collection of documents of the form
{
"id": 42,
"title": "candy can",
"description": "canada candy canteen",
"brand": "cannister candid",
"manufacturer": "candle canvas"
}
I need to implement auto-complete feature based on the input search term by matching in the fields except id. For example, if the input term is can, then I should return all matching words in the document as
{ hints: ["candy", "can", "canada", "canteen", ...]
I looked at this question but it didn't help. I also tried searching how to do regex search in multiple fields and extract matching tokens, or extracting matching tokens in a MongoDB text search but couldn't find any help.
tl;dr
There is no easy solution for what you want, since normal queries can't modify the fields they return. There is a solution (using the below mapReduce inline instead of doing an output to a collection), but except for very small databases, it is not possible to do this in realtime.
The problem
As written, a normal query can't really modify the fields it returns. But there are other problems. If you want to do a regex search in halfway decent time, you would have to index all fields, which would need a disproportional amount of RAM for that feature. If you wouldn't index all fields, a regex search would cause a collection scan, which means that every document would have to be loaded from disk, which would take too much time for autocompletion to be convenient. Furthermore, multiple simultaneous users requesting autocompletion would create considerable load on the backend.
The solution
The problem is quite similar to one I have already answered: We need to extract every word out of multiple fields, remove the stop words and save the remaining words together with a link to the respective document(s) the word was found in a collection. Now, for getting an autocompletion list, we simply query the indexed word list.
Step 1: Use a map/reduce job to extract the words
db.yourCollection.mapReduce(
// Map function
function() {
// We need to save this in a local var as per scoping problems
var document = this;
// You need to expand this according to your needs
var stopwords = ["the","this","and","or"];
for(var prop in document) {
// We are only interested in strings and explicitly not in _id
if(prop === "_id" || typeof document[prop] !== 'string') {
continue
}
(document[prop]).split(" ").forEach(
function(word){
// You might want to adjust this to your needs
var cleaned = word.replace(/[;,.]/g,"")
if(
// We neither want stopwords...
stopwords.indexOf(cleaned) > -1 ||
// ...nor string which would evaluate to numbers
!(isNaN(parseInt(cleaned))) ||
!(isNaN(parseFloat(cleaned)))
) {
return
}
emit(cleaned,document._id)
}
)
}
},
// Reduce function
function(k,v){
// Kind of ugly, but works.
// Improvements more than welcome!
var values = { 'documents': []};
v.forEach(
function(vs){
if(values.documents.indexOf(vs)>-1){
return
}
values.documents.push(vs)
}
)
return values
},
{
// We need this for two reasons...
finalize:
function(key,reducedValue){
// First, we ensure that each resulting document
// has the documents field in order to unify access
var finalValue = {documents:[]}
// Second, we ensure that each document is unique in said field
if(reducedValue.documents) {
// We filter the existing documents array
finalValue.documents = reducedValue.documents.filter(
function(item,pos,self){
// The default return value
var loc = -1;
for(var i=0;i<self.length;i++){
// We have to do it this way since indexOf only works with primitives
if(self[i].valueOf() === item.valueOf()){
// We have found the value of the current item...
loc = i;
//... so we are done for now
break
}
}
// If the location we found equals the position of item, they are equal
// If it isn't equal, we have a duplicate
return loc === pos;
}
);
} else {
finalValue.documents.push(reducedValue)
}
// We have sanitized our data, now we can return it
return finalValue
},
// Our result are written to a collection called "words"
out: "words"
}
)
Running this mapReduce against your example would result in db.words look like this:
{ "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
{ "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Note that the individual words are the _id of the documents. The _id field is indexed automatically by MongoDB. Since indices are tried to be kept in RAM, we can do a few tricks to both speed up autocompletion and reduce the load put to the server.
Step 2: Query for autocompletion
For autocompletion, we only need the words, without the links to the documents.
Since the words are indexed, we use a covered query – a query answered only from the index, which usually resides in RAM.
To stick with your example, we would use the following query to get the candidates for autocompletion:
db.words.find({_id:/^can/},{_id:1})
which gives us the result
{ "_id" : "can" }
{ "_id" : "canada" }
{ "_id" : "candid" }
{ "_id" : "candle" }
{ "_id" : "candy" }
{ "_id" : "cannister" }
{ "_id" : "canteen" }
{ "_id" : "canvas" }
Using the .explain() method, we can verify that this query uses only the index.
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 8,
"nscannedObjects" : 0,
"nscanned" : 8,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 8,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
"can",
"cao"
],
[
/^can/,
/^can/
]
]
},
"server" : "32a63f87666f:27017",
"filterSet" : false
}
Note the indexOnly:true field.
Step 3: Query the actual document
Albeit we will have to do two queries to get the actual document, since we speed up the overall process, the user experience should be well enough.
Step 3.1: Get the document of the words collection
When the user selects a choice of the autocompletion, we have to query the complete document of words in order to find the documents where the word chosen for autocompletion originated from.
db.words.find({_id:"canteen"})
which would result in a document like this:
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Step 3.2: Get the actual document
With that document, we can now either show a page with search results or, like in this case, redirect to the actual document which you can get by:
db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})
Notes
While this approach may seem complicated at first (well, the mapReduce is a bit), it is actual pretty easy conceptually. Basically, you are trading real time results (which you won't have anyway unless you spend a lot of RAM) for speed. Imho, that's a good deal. In order to make the rather costly mapReduce phase more efficient, implementing Incremental mapReduce could be an approach – improving my admittedly hacked mapReduce might well be another.
Last but not least, this way is a rather ugly hack altogether. You might want to dig into elasticsearch or lucene. Those products imho are much, much more suited for what you want.
Thanks to #Markus solution, I came up with something similar with aggregations instead. Knowing that map-reduce are flagged as deprecated for later versions.
const { MongoDBNamespace, Collection } = require('mongodb')
//.replace(/(\b(\w{1,3})\b(\W|$))/g,'').split(/\s+/).join(' ')
const routine = `function (text) {
const stopwords = ['the', 'this', 'and', 'or', 'id']
text = text.replace(new RegExp('\\b(' + stopwords.join('|') + ')\\b', 'g'), '')
text = text.replace(/[;,.]/g, ' ').trim()
return text.toLowerCase()
}`
// If the pipeline includes the $out operator, aggregate() returns an empty cursor.
const agg = [
{
$match: {
a: true,
d: false,
},
},
{
$project: {
title: 1,
desc: 1,
},
},
{
$replaceWith: {
_id: '$_id',
text: {
$concat: ['$title', ' ', '$desc'],
},
},
},
{
$addFields: {
cleaned: {
$function: {
body: routine,
args: ['$text'],
lang: 'js',
},
},
},
},
{
$replaceWith: {
_id: '$_id',
text: {
$trim: {
input: '$cleaned',
},
},
},
},
{
$project: {
words: {
$split: ['$text', ' '],
},
qt: {
$const: 1,
},
},
},
{
$unwind: {
path: '$words',
includeArrayIndex: 'id',
preserveNullAndEmptyArrays: true,
},
},
{
$group: {
_id: '$words',
docs: {
$addToSet: '$_id',
},
weight: {
$sum: '$qt',
},
},
},
{
$sort: {
weight: -1,
},
},
{
$limit: 100,
},
{
$out: {
db: 'listings_db',
coll: 'words',
},
},
]
// Closure for db instance only
/**
*
* #param { MongoDBNamespace } db
*/
module.exports = function (db) {
/** #type { Collection } */
let collection
/**
* Runs the aggregation pipeline
* #return {Promise}
*/
this.refreshKeywords = async function () {
collection = db.collection('listing')
// .toArray() to trigger the aggregation
// it returns an empty curson so it's fine
return await collection.aggregate(agg).toArray()
}
}
Please check for very minimal changes for your convenience.

Mongo regex index matching - multiple start strings

I want to match multiple start strings in mongo. explain() shows that it's using the indexedfield index for this query:
db.mycol.find({indexedfield:/^startstring/,nonindexedfield:/somesubstring/});
However, the following query for multiple start strings is really slow. When I run explain I get an error. Judging by the faults I can see in mongostat (7k a second) it's scanning the entire collection. It's also alternating between 0% locked and 90-95% locked every few seconds.
db.mycol.find({indexedfield:/^(startstring1|startstring2)/,nonindexedfield:/somesubstring/}).explain();
JavaScript execution failed: error: { "$err" : "assertion src/mongo/db/key.cpp:421" } at src/mongo/shell/query.js:L128
Can anyone shed some light on how I can do this or what is causing the explain error?
UPDATE - more info
Ok, so I managed to get explain to work on the more complex query by limiting the number of results. The difference is this:
For a single substring, "^/BA1/" (yes it's postcodes)
"cursor" : "BtreeCursor pc_1 multi",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 10,
"nscannedObjectsAllPlans" : 19,
"nscannedAllPlans" : 19,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"indexedfield" : [
[
"BA1",
"BA2"
],
[
/^BA1/,
/^BA1/
]
]
}
For multiple substrings "^(BA1|BA2)/"
"cursor" : "BtreeCursor pc_1 multi",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 10,
"nscanned" : 1075276,
"nscannedObjectsAllPlans" : 1075285,
"nscannedAllPlans" : 2150551,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 5,
"nChunkSkips" : 0,
"millis" : 4596,
"indexBounds" : {
"indexedfield" : [
[
"",
{
}
],
[
/^(BA1|BA2)/,
/^(BA1|BA2)/
]
]
}
which doesn't look very good.
$or solves the problem in terms of using the indexes (thanks EddieJamsession). Queries are now lightening fast.
db.mycoll.find({$or: [{indexedfield:/^startstring1/},{indexedfield:/^startstring2/],nonindexedfield:/somesubstring/})
However, I would still like to do this with a regex if possible so I'm leaving the question open. Not least because I now have to refactor my application to take these types of queries into account.