Sorting query by distance requires reading entire data set?

Sorting query by distance requires reading entire data set? - amazon-web-services

To perform geoqueries in DynamoDB, there are libraries in AWS (https://aws.amazon.com/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/). But to sort the results of a geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate that (on the backend, not to the user) if you're sorting by distance, is there?

You are correct. To sort all of the datapoint by distance from some arbitrary location, you must read all the data from your DynamoDB table.
In DynamoDB, you can only sort results using a pre-computed value that has been stored in the DynamoDB table and is being used as the sort key of the table or one of its indexes. If you need to sort by distance from a fixed location, then you can do this with DynamoDB.
Possible Workaround (with limitations)
TLDR; it's not such a bad problem if you can get away with only sorting the items that are within X kms from an arbitrary point.
This still involves sorting the data points in memory, but it makes the problem easier by producing incomplete results (by limiting the maximum range of the results.)
To do this, you need the Geohash of your point P (from which you are measuring the distance of all other points). Suppose it is A234311. Then you need to pick what range of results is appropriate. Let's put some numbers on this to make it concrete. (I'm totally making these numbers up because the actual numbers are irrelevant for understanding the concepts.)
A - represents a 6400km by 6400km area
2 - represents a 3200km by 3200km area within A
3 - represents a 1600km by 1600km area within A2
4 - represents a 800km by 800km area within A23
3 - represents a 400km by 400km area within A234
1 - represents a 200km by 200km area within A2343
1 - represents a 100km by 100km area within A23431
Graphically, it might look like this:
View of A View of A23
|----------|-----------| |----------|-----------|
| | A21 | A22 | | | |
| A1 |-----|-----| | A231 | A232 |
| | A23 | A24 | | | |
|----------|-----------| |----------|-----------|
| | | | |A2341|A2342|
| A3 | A4 | | A233 |-----|-----|
| | | | |A2343|A2344|
|----------|-----------| |----------|-----------| ... and so on.
In this case, our point P is in A224132. Suppose also, that we want to get the sorted points within 400km. A2343 is 400km by 400km, so we need to load the result from A2343 and all of its 8-connected neighbors (A2341, A2342, A2344, A2334, A2332, A4112, A4121, A4122). Then once we've loaded only those in memory, then you calculate the distances, sort them, and discard any results that are more than 400km.
(You could keep the results that are more than 400km away as long as the users/clients know that beyond 400km, the data could be incomplete.)
The hashing method that DynamoDB Geo library uses is very similar to a Z-Order Curve—you may find it helpful to familiarize yourself with that method as well as Part 1 and Part 2 of the AWS Database Blog on Z-Order Indexing for Multifaceted Queries in DynamoDB.

Not exactly. When querying location you can query by a fixed query value (partition key value) and by sort key, so you can limit your query data result and also apply a little filtering.
I have been racking my brain while designing a DynamoDB Geo Hash proximity locator service. For this example customer_A wants to find all service providers_X in their area. All customers and providers have a 'g8' key that stores their precise geoHash location (to 8 levels).
The accepted way to accomplish this search is to generate a secondary index from the main table with a less accurate geoHash 'g4' which gives a broader area for the main query key. I am applying key overloading and composite key structures for a single table design. The goal in this design is to return all the data required in a single query, secondary indexes can duplicate data by design (storage is cheap but cpu and bandwidth is not)
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
g4_9q5c provider pr_providerId1 name rating
g4_9q5c provider pr_providerId2 name rating
g4_9q5h provider pr_providerId3 name rating
Scenario1: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=g4_9q5c and a list of two providers is returned, not three I desire.
But using geoHash.neighbor() will return eight surrounding neighbors like 9q5h (see reference below). That's great because there a provider in 9q5h but this means I have to run nine queries, one on the center and eight on the neighbors, or run 1-N until I have the minimum results I require.
But which direction to query second, NW, SW, E?? This would require another level of hinting toward which neighbor has more results, without knowing first, unless you run a pre-query for weighted results. But then you run the risk of only returning favorable neighbors as there could be new providers in previously unfavored neighbors. You could apply some ML and randomized query into neighbors to check current counts.
Before the above approach I tried this design.
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
loc g8_9q5cfmtk pr_provider1
loc g8_9q5cfjgq pr_provider2
loc g8_9q5fe954 pr_provider3
Scenario2: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=loc and GSI1SK in between g8_9q5ca and g8_9q5fz and a list of three providers is returned, but a ton of data was pulled and discarded.
To achieve the above query the between X and Y sort criteria is composed of. 9q5c.neighbors().sorted() = 9q59, 9q5c, 9q5d, 9q5e, 9q5f, 9q5g, 9qh1, 9qh4, 9qh5. So we can just use X=9q59 and Y=9qh5 but there are over 50 (I really didn't count after 50) matching quadrants in such a UTF between function.
Regarding the hash/size table above I would recommend to use this https://www.movable-type.co.uk/scripts/geohash.html
Geohash length Cell width Cell height
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
...

Related

How do I keep a running count in DynamoDB without a hot table row?

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?

You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

AWS Redshift Query Plan Warning

I am new to RedShift and just experimenting at this stage to help with table design.
We have a very simple table with about 6 million rows and 2 integer fields.
Both integer fields are in the sort key but the plan has a warning - "very selective query filter".
The STL_Alert_Event_Log entry is:
'Very selective query filter:ratio=rows(61)/rows_pre_user_filter(524170)=0.000116'
The query we are running is:
select count(*)
from LargeNumberofRowswithUniKey r
where r.benchmarkid = 291891 and universeid = 300901
Our Table DDL is:
CREATE TABLE public.LargeNumberofRowswithUniKey
(
benchmarkid INTEGER NOT NULL DISTKEY,
UniverseID INTEGER NOT NULL
)
SORTKEY
(
benchmarkid,UniverseID
);
We have also run the following commands on the table:
Vacuum full public.LargeNumberofRowswithUniKey;
Analyze public.LargeNumberofRowswithUniKey;
The screenshot of the plan is here: [Query Plan Image][1]
My expectation was that the multiple sort key including Benchmark and Universe and the fact that both are part of the filter predicate would ensure that the design was optimal for the sample query. This does not seem to be the case, hence the red warning symbol in the attached image. Can anyone shed light on this?
Thanks
George
Update 2017/09/07
I have some more information that may help:
If I run a much simpler query which just filters on the first column of the sort key.
select r.benchmarkid
from LargeNumberofRowswithUniKey r
where r.benchmarkid = 291891
This results in 524,170 rows being scanned according to the actual query plan from the console. When I look at the blocks using STV_BLOCKLIST. The relevant blocks that might be required to satisfy my query are:
|slice|col|tbl |blocknum|num_values|minvalue|maxvalue|
| 1| 0|346457| 4| 262085| 291881| 383881|
| 3| 0|346457| 4| 262085| 291883| 344174|
| 0| 0|346457| 5| 262085| 291891| 344122|
So shouldn't there be 786,255 rows scanned (3 x 262,085) instead of 524,170 (2 x 262,085) as listed in the plan?

The "very selective filter" warning is returned when the rows selected vs rows scanned ratio is less than 0.05 i.e. a relatively large number of rows are scanned compared to the number of rows actually returned. This can be caused by having a large number of unsorted rows in a table, which can be resolved by running a vacuum. However, as you're already doing that I think this is happening because your query is actually very selective (you're selecting a single combination of benchmarkid and universeid) and so you can probably ignore this warning.

Side-observation: If you are always selecting values by using both benchmarkid and UniverseID, you should probably use DISTKEY EVEN.
The reason for this is that a benchmarkid DISTKEY would distribute the data between slices based on benchmarkid. All the values for a given benchmarkid would be on the same slice. If your query always provides a benchmarkid in the query, then the query only utilizes one slice.
On the other hand, if it used DISTKEY EVEN, then every slice can participate in the query, making it more efficient (for queries with WHERE benchmarkid = xxx).
A general rule of thumb is:
Use DISTKEY for fields commonly used in JOIN or GROUP BY
Use SORTKEY for fields commonly used in WHERE

Finding the overlapping locations for a 5 mile and 10 mile radius for a list of location data with latittude and longitude

I have a 10,000 observation dataset with a list of location information looking like this:
ADDRESS | CITY | STATE | ZIP |LATITUDE |LONGITUDE
1189 Beall Ave | Wooster | OH | 44691 | 40.8110501 |-81.93361870000001
580 West 113th Street | New York City | NY | 10025 | 40.8059768 | -73.96506139999997
268 West Putnam Avenue | Greenwich | CT | 06830 | 40.81776801 |-73.96324589997
1 University Drive | Orange | CA | 92866 | 40.843766801 |-73.9447589997
200 South Pointe Drive | Miami Beach | FL | 33139 | 40.1234801 |-73.966427997
I need to find the overlapping locations within a 5 mile and 10 mile radius. I heard that their is a function called geodist which may allow me to do that, although I have never used it. The problem is that for geodist to work I may need all the combinations of the latitudes and longitudes to be side by side, which may make the file really really large and hard to use. I also, do not know how I would be able to get the lat/longs for every combination to be side by side.
Does anyone know of a way I could get the final output that I am looking for ?

Here is a broad outline of one possible approach to this problem:
Allocate each address into a latitude and longitude 'grid' by rounding the co-ordinates to the nearest 0.01 degrees or something like that.
Within each cell, number all the addresses 1 to n so that each has a unique id.
Write a datastep taking your address dataset as input via a set statement, and also load it into a hash object. Your dataset is fairly small, so you should have no problems fitting the relevant bits in memory.
For each address, calculate distances only to other addresses in the same cell, or other cells within a certain radius, i.e.
Decide which cell to look up
Iterate through all the addresses in that cell using the unique id you created earlier, looking up the co-ordinates of each from the hash object
Use geodist to calculate the distance for each one and output a record if it's a keeper.
This is a bit more work to program, but it is much more efficient than an O(n^2) brute force search. I once used a similar algorithm with a dataset of 1.8m UK postcodes and about 60m points of co-ordinate data.

Is there a DynamoDB max partition size of 10GB for a single partition key value?

I've read lots of DynamoDB docs on designing partition keys and sort keys, but I think I must be missing something fundamental.
If you have a bad partition key design, what happens when the data for a SINGLE partition key value exceeds 10GB?
The 'Understand Partition Behaviour' section states:
"A single partition can hold approximately 10 GB of data"
How can it partition a single partition key?
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
The docs also talk about limits with a local secondary index being limited to 10GB of data after which you start getting errors.
"The maximum size of any item collection is 10 GB. This limit does not apply to tables without local secondary indexes; only tables that have one or more local secondary indexes are affected."
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections
That I can understand. So does it have some other magic for partitioning the data for a single partition key if it exceeds 10GB. Or does it just keep growing that partition? And what are the implications of that for your key design?
The background to the question is that I've seen lots of examples of using something like a TenantId as a partition key in a multi-tentant environment. But that seems limiting if a specific TenantId could have more than 10 GB of data.
I must be missing something?

TL;DR - items can be split even if they have the same partition key value by including the range key value into the partitioning function.
The long version:
This is a very good question, and it is addressed in the documentation here and here. As the documentation states, items in a DynamoDB table are partitioned based on their partition key value (which used to be called hash key) into one or multiple partitions, using a hashing function. The number of partitions is derived based on the maximum desired total throughput, as well as the distribution of items in the key space. In other words, if the partition key is chosen such that it distributes items uniformly across the partition key space, the partitions end up having approximately the same number of items each. This number of items in each partition is approximately equal to the total number of items in the table divided by the number of partitions.
The documentation also states that each partition is limited to about 10GB of space. And that once the sum of the sizes of all items stored in any partition grows beyond 10GB, DynamoDB will start a background process that will automatically and transparently split such partitions in half - resulting in two new partitions. Once again, if the items are distributed uniformly, this is great because each new sub-partition will end up holding roughly half the items in the original partition.
An important aspect to splitting is that the throughput of the split-partitions will each be half of the throughput that would have been available for the original partition.
So far we've covered the happy case.
On the flip side it is possible to have one, or a few, partition key values that correspond to a very large number of items. This can usually happen if the table schema uses a sort key and several items hash to the same partition key. In such case, it is possible that a single partition key could be responsible for items that together take up more than 10 GB. And this will result in a split. In this case DynamoDB will still create two new partitions but instead of using only the partition key to decide which sub-partition should an item be stored in, it will also use the sort key.
Example
Without loss of generality and to make things easier to reason about, imagine that there is a table where partition keys are letters (A-Z), and numbers are used as sort keys.
Imaging that the table has about 9 partitions, so letters A,B,C would be stored in partition 1, letters D,E,F would be in partition 2, etc.
In the diagram below, the partition boundaries are marked h(A0), h(D0) etc. to show that, for instance, the items stored in the first partition are the items who's partition key hashes to a value between h(A0) and h(D0) - the 0 is intentional, and comes in handy next.
[ h(A0) ]--------[ h(D0) ]---------[ h(G0) ]-------[ h(J0) ]-------[ h(M0) ]- ..
| A B C | E F | G I | J K L |
| 1 1 1 | 1 1 | 1 1 | 1 1 1 |
| 2 2 2 | 2 2 | 2 | 2 |
| 3 3 | 3 | 3 | |
.. .. .. .. ..
| 100 | 500 | | |
+-----------------+----------------+---------------+---------------+-- ..
Notice that for most partition key values, there are between 1 and 3 items in the table, but there are two partition key values: D and F that are not looking too good. D has 100 items while F has 500 items.
If items with a partition key value of F keep getting added, eventually the partition [h(D0)-h(G0)) will split. To make it possible to split the items that have the same hash key, the range key values will have to be used, so we'll end up with the following situation:
..[ h(D0) ]------------/ [ h(F500) ] / ----------[ h(G0) ]- ..
| E F | F |
| 1 1 | 501 |
| 2 2 | 502 |
| 3 | 503 |
.. .. ..
| 500 | 1000 |
.. ---+-----------------------+---------------------+--- ..
The original partition [h(D0)-h(G0)) was split into [h(D0)-h(F500)) and [h(F500)-h(G0))
I hope this helps to visualize that items are generally mapped to partitions based on a hash value obtained by applying a hashing function to their partition key value, but if need be, the value being hashed can include the partition key + a sort key value as well.

Creating train, test and cross validation datasets in sklearn (python 2.7) with a grouping constraints?

While creating a train,test & cross validation sample in Python, I see the default method as -:
1. Reading the dataset , after skipping headers
2. Creating the train, test and Cross validation sample
import csv
with open('C:/Users/Train/Trainl.csv', 'r') as f1:
next(f1)
reader = csv.reader(f1, delimiter=',')
input_set = []
for row in reader:
input_set.append(row)
import numpy as np
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .
Line #|A | B | C | D
1 |1 |
2 |1 |
3 |1 |
4 |1 |
5 |2 |
6 |2 |
7 |2 |
Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample.
Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.
B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)
What I tried :
A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files.
that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.
Ofcourse this approach is not scalable.
And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !
B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.
C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?
Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.
EDIT Part 2:
Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time.
https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py

By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.
The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.
To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.
In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.

I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:
Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)
Similarly, we can create sample for the validation part too.
Cheers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js