How do I keep a running count in DynamoDB without a hot table row? - amazon-web-services

We have a completely server-less architecture and have been using DynamoDB almost since it was released, but I am struggling to see how to deal with tabulating global numbers on a large scale. Say we have users who choose to do either A or B. We want to keep track of how many users do each and they could happen at a high scale. According to DyanamoDB best practices, you are not supposed to write continually to a single row. What is the best way to handle this outside using another service like CouchDB or ElastiCache?

You could bucket your users by first letter of their usernames (or something similar) as the partition key, and either A or B as the sort key, with a regular attribute as the counts.
For example:
PARTITION KEY | SORT KEY | COUNT
--------------------------------
a | A | 5
a | B | 7
b | B | 15
c | A | 1
c | B | 3
The advantage is that you can reduce the risk of hot partitions by spreading your writes across multiple partitions.
Of course, you're trading hot writes for more expensive reads, since now you'll have to scan + filter(A) to get the total count that chose A, and another scan + filter(B) for the total count of B. But if you're writing a bunch and only reading on rare occasions, this may be ok.

Related

Multiple objects lists: how-to merge removing duplicates

the problem is simple and I found several answers on how to proceed but I need a more specific help because of the size of the problem. Here is the situation:
I have several (let's say 20) collections of c++ objects (all of the same type)
Each collection contains hundreds of million of entries
The same entry could be present in more than one of the 20 collections
Each collection is made by few thousand files, each one around 4GB. Each collection is around 50TB and the total size of the collection is around 1PB
CPU Resource available: few thousand nodes (each one with 2GB RAM and a reasonable new CPU). All of them can run asynchronously accessing one by one all the files of the collections
Disk Resource available: I cannot save a full second copy of all collections (I don't have another PB of disk available) but I can reduce the size of each entry keeping only the relevant information. Final reduced size of all collection would be less than 100TB and that's ok.
What I would like to do is to merge the 20 collections to get a single collection with all the entries removing all the duplicates. The total numeber of entry is around 5 billion and there are few percent of duplicated events (let's say around 3-5%).
Another important information is that the total size (all the 20 original collections) is more than 1PB so it's really an heavy task to process the full set of collections.
Finally: at the end of the merging (i.e. when all the duplicates have been removed) the final collection has to be processed several times... so the
output of the merging will be used as input to further processing steps.
Here is an example:
Collection1
------------------------------------------
| | n1 | n2 | n3 | value1...
------------------------------------------
entry0: | 23 | 11 | 34 | ....
entry1: | 43 | 12 | 24 | ....
entry2: | 71 | 51 | 91 | ....
...
Collection2
------------------------------------------
| | n1 | n2 | n3 | value1...
------------------------------------------
entry0: | 71 | 51 | 91 | ....
entry1: | 73 | 81 | 23 | ....
entry2: | 53 | 22 | 84 | ....
...
As you see there are 3 integers that are used for distinguish each entry (n1,n2 and n3) and in collection1 entry2 has the same 3 integers as entry0 in collection2. The latter is a duplication of the former... Merging these 2 collections would give a single collection with 5 entries (having removed entry0
The collections are not sorted and each collection is made by thousands of files (typical file size 4GB and a single collection is tenths of TB)
Any suggestion on which is the best approach?
Thanks for helping
I hope your objects may be ordered? o1 <= o2 <= oN ...
Load one collection in the memory and sort it.
Save it to the disk.
Get next collection.
Sort it.
Merge the two collections on the disk and delete the first one.
Get next collection ...
Given the speed of your network and the number of available nodes, here is one way you could proceed.
You have a total of about 5G entries, and 20 collections. So, on average, 250M entries per collection. Duplicate entries between collections are on the order of 3-5% (7-12M entries). Now, because you have 20 collections scattared over thousands of nodes, each collection is most likely scattered over multiple nodes.
Here are the general steps of what you could do.
For each one of your collections, create a database on a chosen node, where you will store all the entry Id's of the collection. That database will be on the order of a few GBs.
On each node, run a process that scans all files at the node, and add the entry Id's into the collection database.
On a single node, run a process that reads from all collection databases and finds duplicates. When a duplicate is found in two collection, remove the entry Id from one of the two collections.
Run a process on each node, to delete from the files at the node, all entries whose Id's are not in their collection database.
In the end, all duplicates have been eliminated, and you also get 20 databases with the Id's of all the entries in each collection.

Sorting query by distance requires reading entire data set?

To perform geoqueries in DynamoDB, there are libraries in AWS (https://aws.amazon.com/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/). But to sort the results of a geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate that (on the backend, not to the user) if you're sorting by distance, is there?
You are correct. To sort all of the datapoint by distance from some arbitrary location, you must read all the data from your DynamoDB table.
In DynamoDB, you can only sort results using a pre-computed value that has been stored in the DynamoDB table and is being used as the sort key of the table or one of its indexes. If you need to sort by distance from a fixed location, then you can do this with DynamoDB.
Possible Workaround (with limitations)
TLDR; it's not such a bad problem if you can get away with only sorting the items that are within X kms from an arbitrary point.
This still involves sorting the data points in memory, but it makes the problem easier by producing incomplete results (by limiting the maximum range of the results.)
To do this, you need the Geohash of your point P (from which you are measuring the distance of all other points). Suppose it is A234311. Then you need to pick what range of results is appropriate. Let's put some numbers on this to make it concrete. (I'm totally making these numbers up because the actual numbers are irrelevant for understanding the concepts.)
A - represents a 6400km by 6400km area
2 - represents a 3200km by 3200km area within A
3 - represents a 1600km by 1600km area within A2
4 - represents a 800km by 800km area within A23
3 - represents a 400km by 400km area within A234
1 - represents a 200km by 200km area within A2343
1 - represents a 100km by 100km area within A23431
Graphically, it might look like this:
View of A View of A23
|----------|-----------| |----------|-----------|
| | A21 | A22 | | | |
| A1 |-----|-----| | A231 | A232 |
| | A23 | A24 | | | |
|----------|-----------| |----------|-----------|
| | | | |A2341|A2342|
| A3 | A4 | | A233 |-----|-----|
| | | | |A2343|A2344|
|----------|-----------| |----------|-----------| ... and so on.
In this case, our point P is in A224132. Suppose also, that we want to get the sorted points within 400km. A2343 is 400km by 400km, so we need to load the result from A2343 and all of its 8-connected neighbors (A2341, A2342, A2344, A2334, A2332, A4112, A4121, A4122). Then once we've loaded only those in memory, then you calculate the distances, sort them, and discard any results that are more than 400km.
(You could keep the results that are more than 400km away as long as the users/clients know that beyond 400km, the data could be incomplete.)
The hashing method that DynamoDB Geo library uses is very similar to a Z-Order Curve—you may find it helpful to familiarize yourself with that method as well as Part 1 and Part 2 of the AWS Database Blog on Z-Order Indexing for Multifaceted Queries in DynamoDB.
Not exactly. When querying location you can query by a fixed query value (partition key value) and by sort key, so you can limit your query data result and also apply a little filtering.
I have been racking my brain while designing a DynamoDB Geo Hash proximity locator service. For this example customer_A wants to find all service providers_X in their area. All customers and providers have a 'g8' key that stores their precise geoHash location (to 8 levels).
The accepted way to accomplish this search is to generate a secondary index from the main table with a less accurate geoHash 'g4' which gives a broader area for the main query key. I am applying key overloading and composite key structures for a single table design. The goal in this design is to return all the data required in a single query, secondary indexes can duplicate data by design (storage is cheap but cpu and bandwidth is not)
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
g4_9q5c provider pr_providerId1 name rating
g4_9q5c provider pr_providerId2 name rating
g4_9q5h provider pr_providerId3 name rating
Scenario1: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=g4_9q5c and a list of two providers is returned, not three I desire.
But using geoHash.neighbor() will return eight surrounding neighbors like 9q5h (see reference below). That's great because there a provider in 9q5h but this means I have to run nine queries, one on the center and eight on the neighbors, or run 1-N until I have the minimum results I require.
But which direction to query second, NW, SW, E?? This would require another level of hinting toward which neighbor has more results, without knowing first, unless you run a pre-query for weighted results. But then you run the risk of only returning favorable neighbors as there could be new providers in previously unfavored neighbors. You could apply some ML and randomized query into neighbors to check current counts.
Before the above approach I tried this design.
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
loc g8_9q5cfmtk pr_provider1
loc g8_9q5cfjgq pr_provider2
loc g8_9q5fe954 pr_provider3
Scenario2: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=loc and GSI1SK in between g8_9q5ca and g8_9q5fz and a list of three providers is returned, but a ton of data was pulled and discarded.
To achieve the above query the between X and Y sort criteria is composed of. 9q5c.neighbors().sorted() = 9q59, 9q5c, 9q5d, 9q5e, 9q5f, 9q5g, 9qh1, 9qh4, 9qh5. So we can just use X=9q59 and Y=9qh5 but there are over 50 (I really didn't count after 50) matching quadrants in such a UTF between function.
Regarding the hash/size table above I would recommend to use this https://www.movable-type.co.uk/scripts/geohash.html
Geohash length Cell width Cell height
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
...

Is there a DynamoDB max partition size of 10GB for a single partition key value?

I've read lots of DynamoDB docs on designing partition keys and sort keys, but I think I must be missing something fundamental.
If you have a bad partition key design, what happens when the data for a SINGLE partition key value exceeds 10GB?
The 'Understand Partition Behaviour' section states:
"A single partition can hold approximately 10 GB of data"
How can it partition a single partition key?
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
The docs also talk about limits with a local secondary index being limited to 10GB of data after which you start getting errors.
"The maximum size of any item collection is 10 GB. This limit does not apply to tables without local secondary indexes; only tables that have one or more local secondary indexes are affected."
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections
That I can understand. So does it have some other magic for partitioning the data for a single partition key if it exceeds 10GB. Or does it just keep growing that partition? And what are the implications of that for your key design?
The background to the question is that I've seen lots of examples of using something like a TenantId as a partition key in a multi-tentant environment. But that seems limiting if a specific TenantId could have more than 10 GB of data.
I must be missing something?
TL;DR - items can be split even if they have the same partition key value by including the range key value into the partitioning function.
The long version:
This is a very good question, and it is addressed in the documentation here and here. As the documentation states, items in a DynamoDB table are partitioned based on their partition key value (which used to be called hash key) into one or multiple partitions, using a hashing function. The number of partitions is derived based on the maximum desired total throughput, as well as the distribution of items in the key space. In other words, if the partition key is chosen such that it distributes items uniformly across the partition key space, the partitions end up having approximately the same number of items each. This number of items in each partition is approximately equal to the total number of items in the table divided by the number of partitions.
The documentation also states that each partition is limited to about 10GB of space. And that once the sum of the sizes of all items stored in any partition grows beyond 10GB, DynamoDB will start a background process that will automatically and transparently split such partitions in half - resulting in two new partitions. Once again, if the items are distributed uniformly, this is great because each new sub-partition will end up holding roughly half the items in the original partition.
An important aspect to splitting is that the throughput of the split-partitions will each be half of the throughput that would have been available for the original partition.
So far we've covered the happy case.
On the flip side it is possible to have one, or a few, partition key values that correspond to a very large number of items. This can usually happen if the table schema uses a sort key and several items hash to the same partition key. In such case, it is possible that a single partition key could be responsible for items that together take up more than 10 GB. And this will result in a split. In this case DynamoDB will still create two new partitions but instead of using only the partition key to decide which sub-partition should an item be stored in, it will also use the sort key.
Example
Without loss of generality and to make things easier to reason about, imagine that there is a table where partition keys are letters (A-Z), and numbers are used as sort keys.
Imaging that the table has about 9 partitions, so letters A,B,C would be stored in partition 1, letters D,E,F would be in partition 2, etc.
In the diagram below, the partition boundaries are marked h(A0), h(D0) etc. to show that, for instance, the items stored in the first partition are the items who's partition key hashes to a value between h(A0) and h(D0) - the 0 is intentional, and comes in handy next.
[ h(A0) ]--------[ h(D0) ]---------[ h(G0) ]-------[ h(J0) ]-------[ h(M0) ]- ..
| A B C | E F | G I | J K L |
| 1 1 1 | 1 1 | 1 1 | 1 1 1 |
| 2 2 2 | 2 2 | 2 | 2 |
| 3 3 | 3 | 3 | |
.. .. .. .. ..
| 100 | 500 | | |
+-----------------+----------------+---------------+---------------+-- ..
Notice that for most partition key values, there are between 1 and 3 items in the table, but there are two partition key values: D and F that are not looking too good. D has 100 items while F has 500 items.
If items with a partition key value of F keep getting added, eventually the partition [h(D0)-h(G0)) will split. To make it possible to split the items that have the same hash key, the range key values will have to be used, so we'll end up with the following situation:
..[ h(D0) ]------------/ [ h(F500) ] / ----------[ h(G0) ]- ..
| E F | F |
| 1 1 | 501 |
| 2 2 | 502 |
| 3 | 503 |
.. .. ..
| 500 | 1000 |
.. ---+-----------------------+---------------------+--- ..
The original partition [h(D0)-h(G0)) was split into [h(D0)-h(F500)) and [h(F500)-h(G0))
I hope this helps to visualize that items are generally mapped to partitions based on a hash value obtained by applying a hashing function to their partition key value, but if need be, the value being hashed can include the partition key + a sort key value as well.

DynamoDB Schema Design

I'm thinking of using Amazon AWS DynamoDB for a project that I'm working on. Here's the gist of the situation:
I'm going to be gathering a ton of energy usage data for hundreds of machines (energy readings are taken around every 5 minutes). Each machine is in a zone, and each zone is in a network.
I'm then going to roll up these individual readings by zone and network, by hour and day.
My thinking is that by doing this, I'll be able to perform one query against DynamoDB on the network_day table, and return the energy usage for any given day quickly.
Here's my schema at this point:
table_name | hash_key | range_key | attributes
______________________________________________________
machine_reading | machine.id | epoch | energy_use
machine_hour | machine.id | epoch_hour | energy_use
machine_day | machine.id | epoch_day | energy_use
zone_hour | machine.id | epoch_hour | energy_use
zone_day | machine.id | epoch_day | energy_use
network_hour | machine.id | epoch_hour | energy_use
network_day | machine.id | epoch_day | energy_use
I'm not immediately seeing that great of performance in tests when I run the rollup cronjob, so I'm just wondering if someone with more experience could comment on my key design? The only experience I have so far is with RDS, but I'm very much trying to learn about DynamoDB.
EDIT:
Basic structure for the cronjob that I'm using for rollups:
foreach network
foreach zone
foreach machine
add_unprocessed_readings_to_dynamo()
roll_up_fixture_hours_to_dynamo()
roll_up_fixture_days_to_dynamo()
end
roll_up_zone_hours_to_dynamo()
roll_up_zone_days_to_dynamo()
end
roll_up_network_hours_to_dynamo()
roll_up_network_days_to_dynamo()
end
I use the previous function's values in Dynamo for the next roll up, i.e.
I use zone hours to roll up zone days
I then use zone days to roll up
network days
This is what (I think) is causing a lot of unnecessary reads/writes. Right now I can manage with low throughputs because my sample size is only 100 readings. My concerns begin when this scales to what is expected to contain around 9,000,000 readings.
First things first, time series data in DynamoDB is hard to do right, but not impossible.
DynamoDB uses the hash key to shard the data so using the machine.id means that some of you are going to have hot keys. However, this is really a function of the amount of data and what you expect your IOPS to be. DynamoDB doesn't create a 2nd shard until you push past 1000 read or write IOPS. If you expect to be well below that level you may be fine, but if you expect to scale beyond that then you may want to redesign, specifically include a date component in your hash key to break things up.
Regarding performance, are you hitting your provisioned read or write throughput level? If so raise them to some crazy high level and re-run the test until the bottleneck becomes your code. This could be a simple as setting the throughput level appropriately.
However, regarding your actual code, without seeing the actual DynamoDB queries you are performing a possible issue would be reading too much data. Make sure you are not reading more data than you need from DynamoDB. Since your range key is a date field use a range conditional (not a filter) to reduce the number of records you need to read.
Make sure you code executes the rollup using multiple threads. If you are not able to saturate the DynamoDB provisioned capacity the issue may not be DynamoDB, it may be your code. By performing the rollups using multiple threads in parallel you should be able to see some performance gains.
What's the provisioned throughput on the tables you are using? how are you performing the rollup? Are you reading everything and filtering / filtering on range keys, etc?
Do you need to roll up/a cron job in this situation?
Why not use a table for the readings
machine_reading | machine.id | epoch_timestamp | energy_use
and a table for the aggregates
hash_key can be aggregate type and range key can be aggregate name
example:
zone, zone1
zone, zone3
day, 03/29/1940
when getting machine data, dump it in the first table and after that use atomic counters to increment entities in 2nd table:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Fast way to lookup entries with duplicates

I am looking for a way to help me quickly look up duplicate entries in some sort of table structure in a very efficient way. Here's my problem:
I have objects of type Cargo where each Cargo has a list of other associated Cargo. Each cargo also has a type associated with it.
so for example:
class Cargo {
int cargoType;
std::list<Cargo*> associated;
}
now for each cargo object and its associated cargo, there is a certain value assigned to it based on their types. This evaluation happens by classes that implement CargoEvaluator.
Now, I have a CargoEvaluatorManager which basically handles connecting everything together. CargoEvaluators are registered with CargoEvaluatorManager, then, to evaluate cargo, I call CargoEvaluatorManager.EvaluateCargo(Cargo* cargo).
Here's the current state
class CargoEvaluatorManager{
std::vector<std::vector<CargoEvaluator*>> evaluators;
double EvaluateCargo(Cargo* cargo)
{
double value = 0.0;
for(auto& associated : cargo->associated) {
auto evaluator = evaluators[cargo->cargoType][associated->cargoType];
if(evaluator != nullptr)
value += evaluator->Evaluate(cargo, associated);
}
return value;
}
}
So to recap and mention a few extra points:
CargoEvaluatorManager stores CargoEvaluators in a 2-D array using cargo types as indices. The entire 2d vector is initialized with nullptrs. When a CargoEvaluator is registered, resizing the array and the other bits and peieces I haven't shown here are handled appropriately.
I had tried using a map with std::pair as a key to look up different evaluators, but it is too slow.
This only allows one CargoEvaluator per combination of cargotype. I want to have multiple cargo evaluators potentially as well.
I am calling this EvaluateCargo tens and billions of times. I am aware my current implementation is not the most efficient and am looking for alternatives.
What I am looking for
As stated above, I want to do much of what I've outlined with the exception that I want to allow multiple Evaluators for each pair of Cargo types. As I envision it, naively, is a table like this :
--------------------------------------------
|int type 1 | int type 2 | CaroEvaluator* |
|------------------------------------------|
| 5 | 7 | Eval1* |
| 5 | 7 | Eval2* |
| 4 | 6 | Eval3* |
--------------------------------------------
The lookup should be symmetric in that the set (5,7) and (7,5) should resolve to the same entries. For speed I don't mind preloading duplicate entries.
There are maybe 3x or more more associated Caro in the list than there are Evaluators, if that factors into things.
Performance is crucial, as mentioned above!
For bonus points, each cargo evaluator may have an additional penalty value associated with it, that is not dependent on how many Associates a Cargo has. In other words: if a row in the table above is looked up, I want to call double Evaluator->AddPenality() once and only once each time EvaluateCargo is called. I cannot store any instance variables since it would cause some multithreading issues.
One added constraint is I need to be able to identify the CargoEvaluators associated with a particular cargotype, meaning that hashing the two cargotypes together is not a viable option.
If any further clarification is needed, I'll gladly try to help.
Thank you all in advance!