AWS DynamoDB geospatial query using ElasticSearch - amazon-web-services

I am storing transactions tagged with geo-coordinates in a DynamoDB table. I want to do a geo-spatial query to find all transactions within e.g. 10 miles distance from an input pair of coordinates.
I saw here that I could perform such a geo-spatial query using AWS ElasticSearch. However, I am not sure if I want to pay the hourly fee for the service at this time if that is the only purpose I will use it for.
An alternative I thought of is to keep only 4 digits after the decimal point of each coordinate when storing and read all the transactions that have the same set of coordinates since they would essentially belong to the same like 100~200 m^2 range. This isn't a very good solution in terms of accuracy and range.
Any suggestion to a better alternative for such a geo-spatial query or on whether ElasticSearch would be a worthy investment based on time/cost?

You could consider using "Geo Library for Amazon DynamoDB". Features include Radius Queries: "Return all of the items that are within a given radius of a geo point."
It seems to have at least Java and JavaScript versions:
https://github.com/awslabs/dynamodb-geo
https://www.npmjs.com/package/dynamodb-geo
Elasticsearch seems to support GeoHashing natively so it will probably have even better performance: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geohashgrid-aggregation.html
Personally I would recommend using Elasticsearch for searching because it's extremely powerful at that and searching with DynamoDB can be difficult.

You can't change the data type after the index field is created. I used this code in kabana to declare the data type as "geo_point". Then uploaded an item with the geopoint field and it worked.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html
POST yourProjectName/_mappings/yourProjectType
{
"properties":{
"geopoint or whatever field you're storing the geo data": {
"type": "geo_point"
}
}
}
}
POST _search
{
"query": {
"bool" : {
"must" : {
"match":{
"summary": "something"
}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"geopoint" : "40.054447974637476,-82.92002800852062"
}
}
}
}
}

Related

Mongodb index based text searches to match full string

While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for 'google seo', it'd return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like 'Why should I google seo', 'What is google seo', 'What does google have to do with seo' etc. should return. Any combination of the following would be perfect. I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn't really matter if the results aren't the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words 'google' and 'seo' in the same sentence, it is alot of unnecessary code which also isnt 100% accurate. The current data base schema is as follows
{
_id: 0000000000,
search_string: string,
difficulty: number,
clicks: number,
volume: number,
keyword: string
}
The backend is a NodeJS server.
Any help is appreciated. Thanks!
Would combining the two approaches (text search and a regex) work?
No playground link since this needs a text index to demonstrate, but consider the following sample documents:
test> db.foo.find()
[
{ _id: 1, val: 'google seo' },
{ _id: 2, val: 'google ' },
{ _id: 3, val: 'seo random ' },
{ _id: 4, val: 'none' }
]
As described in the question and noted in the documentation, a search on 'google seo' returns all documents that match at least one of those terms (3 of the 4 in this sample data):
test> db.foo.find({$text:{$search:'google seo'}})
[
{ _id: 2, val: 'google ' },
{ _id: 1, val: 'google seo' },
{ _id: 3, val: 'seo random ' }
]
If we expand the query predicates to also include regexes on both of the terms via the $all operator, the results are narrowed down to just the single document:
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' }
]
It also works if the words are out of order as we'd expect:
test> db.foo.insert({_id:5, val:'seo out of order google string'})
{ acknowledged: true, insertedIds: { '0': 5 } }
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' },
{ _id: 5, val: 'seo out of order google string' }
]
The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.
Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like must or filter would satisfy this use-case as well (reference).
What worked for me
Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
Instead there are two approaches that can decrease either time, server resources or bandwidth usage.
Using the server to process the data before sending it over.This might seem obvious but if you have the hardware overhead to perform such an operation on the server end than it is much better and faster to run string comparisons server side and send the data back in chunks to be lazy loaded into your app.
For me this decreased average search times from over 29.3s to just below 2.23s on average with a sample size of a database of 250m entries, 80k results per search and around 10k-15k filtered results.
If you don't have the processing overhead and are willing to sacrifice on bandwidth and user experience, then doing this on the client side isn't out of the equation especially considering just how capable modern hardware is. This provides a bit more flexibility such that all the data can be shown with the relevant data being shown first and the other 'irrelevent data' being shown last. This does need to be well optimized and be implemented with the supposition that your app would mostly be run on modern hardware or preferably not on mobile devices.
These were the best solutions for me. There might be better native techniques but over the span of week, I wasn't able to find any better ( faster ) solutions.
EDIT
I feel it's kind of necessary to elaborate on what kind of processing does the data undergo before it is sent out and exactly how I do it. Currently I have a database of around 250m entries. Each entry having the schema described in the question. The average query would usually be something like 'who is putin', 'android', 'iphone 13' etc. The database is made up of 12 collections for each 'major' keyword (what, why, should, how, when, which, who etc.) so the query is first stripped of those. So if the query was 'who is putin' it is converted to just 'is putin'. For cases where there is no keyword, all collections are checked. If there is a better way. let me know
After that we send the query to the database and retrieve the results.
the query undergoes another function afterwards which rids it of 'unnecessary' words so words like is, if, a, be etc. are also removed and it returns an array of the major words so a query like 'What is going on between Russia and Ukraine' gets converted to ['going', 'between', 'Russia', 'Ukraine'] . As the results are received, we go over each of them to see if they include all the words from the array and whichever do, are returned to the client. Pretty basic operation here as we don't care about cases, spaces and so on. Simply uses the js contains() method. The average times that I get while retrieving a query with precisely 2,344,123 results takes around 2.12s cold to return the first results and just over 8.32s cold to end. Running the same query again reduces times to around .84s warm and 1.98s warm to finish (cold for first time and warm for subsequent requests).

How to convert existing Elasticsearch data from string to number

I am streaming AWS Cloudwatch logs (from a Node.js Lambda application) to an AWS Elasticsearch cluster, so that I can view metrics in Kibana.
Some of the data I was streaming was numeric, but was being logged as strings. I've updated the application code to log these as numeric values, however I can't use numeric visualizations in Kibana on those fields because the field type is now mixed -- i.e. in Kibana settings it says 13 fields are defined as several types (string, integer, etc) across the indices that match this pattern...
Is there a straightforward way to force ES / Kibana to treat that field as always numeric? Or convert all of the older logged data from string to number?
My searches have indicated I can do this with some kind of mutation using the ES API, but I can't track down what this API call would actually look like. Disclaimer: Elasticsearch noob.
There are two approaches here:
Convert all the data from strings to numeric values. Essentially, you'll have to reindex the whole data(we can't just change the field type with one click), making sure that the strings are converted / typecast to numeric values. The best way to reindex is to use Ingest Node Pipelines
Pros: Visualizations built on this data will be fast as the data is already in numeric format.
Cons: If the data set is huge this conversion can take long time.
Keep all data in string format as-it-is and use Scripted Fields in Kibana, to convert the data to numeric format at runtime e.g. whenever you visualize
Pros: No need to setup a whole new pipeline to convert the data
Cons: Visualizations on large timeframes might be too slow / heavy for your infrastructure.
Here is the scripted field I created, thanks to Abhishek's answer:
String key = 'myfield';
if (doc.containsKey(key + '.keyword')) {
key += '.keyword';
if (doc[key].size() != 0 && doc[key] != null) {
if (doc[key].value instanceof String) {
return Double.parseDouble(doc[key].value);
}
}
} else if (doc.containsKey(key) && doc[key].size() != 0 && doc[key] != null) {
return doc[key].value;
}

Data modeling with document database?

I am new to working with Data.
So I have a lot of data based on time.
Data row for every 15 mins. Should I compute the data and store data for every 1 hour, 1 day, 1 month on the database?
if I do would this schema be good.
{
_id: "joe",
name: "Joe Bookreader",
time min: [
{
time: "1",
steps: "10"
},
{
time: "2",
steps: "4"
}
]
time day: [
{
time: "1",
steps: "30"
},
{
time: "2",
steps: "30"
}
]
}
If you have any advice on how I can improve my data modeling knowledge with document databases, I would be really grateful.
For a minute step away from programmatic approach to the problem and think about the task at hand.
How are you going to use that data after you stored it? When you use the data it is important for you to know exactly number of steps for a particular user or you want to see a big picture based on the time particular sample points in time.
If you care for per user perspective then your scheme above will work. On the other hand if you want to run global reports like how far along users were on average (or total) during certain time,then I would opt in for schema where your document is time (point in time or range in time), while user and steps are your properties.
Another important concept in database is not to statically store data that can be calculated on the fly. As with any rules there are some exceptions to this. Like Cached values that are short lived and will not have major effect on your application if they are incorrect. Another one is reports, you produced a report for the user based on current values and stored it. If user feels like getting fresh data, user will re-run the report. (I am sure there are few other)
But in most cases the risk that comes with serving stale/wrong data resulting in wrong decision based on that data will outweigh performance benefit of avoiding extra calculations.
The reason I am mentioning this, is because you are storing time min and time day. If time day can be calculated based on time min you should not store it in the database, but rather calculate it on the fly. You can write queries that will produce actual result of time day without using any extra computational power on your application node. All computations will be done on the data node, much more efficiently than a compute node and without network penalties.
I realize this post is a bit old, but I hope my answer will help someone.

Getting count of distinct groupings in RavenDB index

I have a number of documents within RavenDB of the form:
{
"Id": "composite of namespace and video id",
"Namespace": "youtube",
"VideoId": "12345678901",
"Start": "00:00:05"
}
I have a number of documents that reference different segments of the actual thing; in this case, I have multiple documents representing different timestamps within a video.
What I'd like to do is get a count of the distinct number of VideoId instances for a particular Namespace.
At first, I thought I could handle the distinct in the mapping:
from v in docs.Clips.Select(c => new { c.Namespace, c.VideoId }).Distinct()
But that doesn't work, as that query isn't run over the entire document set (so it's impossible to perform a Distinct call here).
I've thought about trying to handle this in the reduce part, but I can't think of an aggregate operation which would group this appropriately.
The shape of the map/reduce structure right now is:
new { Type = "providercount", Key = "youtube", Count = 1 }
As this is part of a multi-map which produces a summary.
How can I produce the count of distinct Namespace/VideoId values with this document structure?
One way to do it might be to group by NameSpace and VideoId. That will get you distinct items. Then you would have to count all of those groups in a TransformResults section. However, I don't recommend doing this with a large number of items. Transformation steps run as part of the query, so performance would be a big problem.
A better approach would be to keep an additional separate document per video (not per clip). For example:
videos/youtube/12345678901
{
"Title": "whatever",
"NumberOfClips": 3,
"Clips": ["clipid1","clipid2","clipid3"]
}
I put a few properties in there that might be useful for other purposes, but the main point is that there is only one document per video.
Building these documents could be done in a couple of different ways:
You could write code in your application to add/update the Video documents at the same time you are writing Clip documents.
You could write a map/reduce index for the Clip documents and group by the NameSpace/VideoId, and then use the Indexed Properties Bundle to maintain the Video documents from the results.
Either way, once you have the set of Video documents, you can then do a simple map/reduce on those to get the count of distinct videos.

mongodb find followed by update semantics

This page shows a update reaching into a previously retrieved (find) document and querying a subelement (array) to update it. I pretty much need to do the exact same thing. Code for the example:
> t.find()
{ "_id" : ObjectId("4b97e62bf1d8c7152c9ccb74"), "title" : "ABC",
"comments" : [ { "by" : "joe", "votes" : 3 }, { "by" : "jane", "votes" : 7 } ] }
> t.update( {'comments.by':'joe'}, {$inc:{'comments.$.votes':1}}, false, true )
What are the rules governing find-followed-by-update, I haven't noticed an explanation for this in the documentation. Does the same stuff apply to use of mongodb via drivers ? A link to the relevant semantics would be helpful. I am using the C++ driver.
edit: self answer
the 2 commands can be rolled into one (and this is one way of removing the ambiguity this question raises), the query part of an update can refer to a array sub-element, and the $ symbol will reference to it. I assume you can only reference one sub-element in the query part of an update operation. In my case the update operation looks as follows :
db.qrs.update ( { "_id" : ObjectId("4f1fa126adf93ab96cb6e848"), "urls.u_id" : 171 }, { "$inc" : { "urls.$.CC": 1} })
The _id correctly "primes" the right unique row, and the second query element "urls.u_id" : 171 assures that the row in question has the right field. urls.$.CC then routes the $inc operation to the correct array entry.
recomendation to any mongodb dev or document writer
Do not show examples which have potential race conditions in them. Always avoid showing multiple operations that can be done atomically.
The rules are relatively straightforward. The results of the update may or may not be available to any subsequent reads depending on a number of things (slaveOk true/false in combination with repsets, update and find using different connections, write safety). You can guarantee it to be available if you do a safe write (w >= 1) and perform the find on the same connection. Most drivers offer functionality for this (typically "requestStart" and "requestDone").
All that said, there's a much better solution available to you for this, namely findAndModify. This operation finds a document, updates it and returns either the old version of the document or the newly updated version. This command is available in the C++ driver. For a reference look here : http://www.mongodb.org/display/DOCS/findAndModify+Command
EDIT : Please note that the "find" in the example is only there to show the reader of the documentation what the structure/schema of the documents inside the collection is to place the subsequent "update" in context. The "update" operation is in no way affected by the "find" before it.