How to implement simple dynamodb table with daily value

How to implement simple dynamodb table with daily value - amazon-web-services

I'am learning AWS API Gateway + Lambda + Dynamodb by building a very simple API project.
I have a daily value starting from 2013-01-01 and keep updating every day, so basically is something like:
[
{
"value": 1776.09,
"date": "2013-01-01"
},
{
"value": 1779.25,
"date": "2013-01-02"
},
// ...
{
"value": 2697.32,
"date": "2018-11-22"
}
]
In the API I want to get the data for a specific day and for a range (dateFrom - dateTo), and I've been reading about Dynamodb and planning to have date as partition key in format YYYY-MM-DD and no sorting key, but not sure if this is the correct aproach for this type of data and the range query I'm going to be doing as I assume I'm going to have to do a full table scan for the range query, although is a small data set.
Can someone point me if this aproach is right or do I need to reconsider my table structure.

What you propose will work.
However, if you want to improve the efficiency of the design, you could use a partition key of YYYY and then your sort key could be MM-DD. That way, you can use a query operation to limit the results (or you could still use a scan).
You could even use a single, constant value for the partition key and date as the sort key, but having the same partition key for every item is generally not recommended.
Either way, your data is small enough that you should probably just pick the implementation that is simplest to develop and maintain.

Copying my answer from this post
Few concepts of NOSQLdb
writes should be equally spread out on primary keys.
read should be equally spread out on primary keys.
The obvious thing that comes to mind looking at given problem and dyanamodb schema is
have key logs as primary key and timestamp as secondary key. And to do an aggregation use
select * where pk=logs and sk is_between x and y
but this will violate both the concepts. We are always writing on a single pk and always reading from the same.
Now to this particular problem,
Our PK should be random enough (so that no hot keys) and deterministic enough (so that we can query)
we will have to make some assumptions about application while designing keys. let's say we decide that we will update every hour. hence can have 7-jan-2018-17 as a key. where 17 means 17th hour. this Key is deterministic but it is not random enough. and every update or read on 7th jan will mostly be going to same partition. To make the key random we can calculate hash of it using hashing algo like md5. let's say after taking hash, our key becomes 1sdc23sjdnsd. This will not make any sense if you are looking at table data. But if you want to know the event count on 7-jan-2018-17 you just hash the time and do a get from dynamodb with the hashkey.
if you want to know all the events on 7-jan-2018 you can do repeated 24 gets and aggregate the count.
Now this kind of schema will have issues where
If you decide to change from hourly to minute basis.
If most of your queries are run time like get me all the data for last 2,4,6 days. It will mean too many round trips to db. And it will be both time and cost inefficient.
Rule of thumb is when query patterns are well defined, use NOSQL and store the results for performance reasons. If you are trying to do a join or aggregation sort of queries on nosql, it is force fitting your use case based on your technology choice.
You can also looks at aws recommendation of storing time series data.

Related

How to limit dynamodb scan to a given partition key and NOT read the entire table

Theoretical table with billions of entries.
Partition key is a unique uuid representing a given deviceId. There will be around 10k unique uuids.
Sort Key is a dateString for when the data was collected.
Each item has some data fields. There are dozens of fields such that making a GSI for each wouldn't be reasonable. For our example, let's say we are looking for the "dataOfInterest" field.
I'd like to search the DB for "all items where the dataOfInterest = 'foobar'" - and ideally do it within a date range. As far as I know, a scan operation is the only option. With billions entries... that's not going to be a fast process (though I understand I could split it out to run multiple operations at a time - it's stil going to eat RCU's like crazy)
Of note, I only care about a given uuid for each search, however. In other words, what I REALLY care about is "all items within a given partition where the dataOfInterest = 'foobar'". And futher, it'd be great to use the sort key to give "all items within a given partition where the dataOfInterest = 'foobar' that are between Jan 1 and Feb 28"
The scan operation allows you to limit the results with a filter expression such that I could get the results of just a single partition ... but it still reads the entire table and the filtering is done before returning the data to you. https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html
Is there an AWS API that does a scan-like operation that reads only a given partition? Are there other ways to achieve this (perhaps re-architecting the DB?)

As #jarmod says, you can use a Query and specify the PK of the UUID. You can then either put the timestamp into the SK and filter for the dataOfInterest value (unindexed), or for more efficiency and to make everything indexed you can construct a composite SK which is dataOfInterest#timestamp and then do a range query on the SK of foobar#time1 to foobar#time2. That makes this query perfectly index optimized.
Course, this makes purely timestamp-based queries less simple. So you either do multiple queries for those or, if you want both queries efficient, setup this composite SK in a GSI and use that to resolve this query.

DynamoDB Schema - Retrieve All Users

I am trying to model my data. As you see the partition key is the user email. In the global secondary index I have a PK of "US", which stands for "User". If I want to get all of the enabled users I just have to query the GSI where GSI1PK = "US" and GSI1SK Starts with "Enabled".
My concern is that all of the users in the app would have the same GSI1PK. Will this be a problem? Can GSIs PK have problems with "hot partitions"? I am Googling this and I do not see a clear answer. There is only one here on StackOverflow that says it will be a problem, but there are other places that say it will not. I am kind of confused.
What would be the best way to structure the data in my table so I can access all of the users without causing hot artition issues?

Placing a potentially large item collection in a single partition will likely lead to a hot partition. Ideally, your chosen partition keys evenly distribute data across partitions. However, it may not always be clear about how to achieve this.
You might consider splitting your large partition into smaller partitions on write (aka write sharding), and re-combining them when reading. For example, when creating GSIPK, you could introduce a randomly generated integer between 1 and 4 in the partition key:
And your GSI would look like this
Now your User data is more evenly distributed across partitions. When reading users from your table, you would pull from all the partitions at once. This could be done in parallel for faster performance.
In this example, I chose a random number to "write shard" the data into separate partitions. However, your data may lend itself to a more natural division (e.g. by country, enabled status, time zone, etc). What I want to highlight is that your strategy to distribute data across partitions can be separate from the data model you use to support your application access patterns.

When it's worth the tradeoff of using local secondary index in DynamoDB?

I've read guidelines for secondary indexes but I'm not sure when the ability to search fast outweighs the disadvantage of scan over attributes. Let me give you an example.
I am saving game progress data for users. The PK is user ID. I need to be able to:
Find out user progress about a particular game.
Get all finished/in progress games for a user.
Thus, I can design my SK as progress_{state} to be able to query all games by progress fast (state represents started/finished) or I can design my SK as progress_{gameId} to be able to query progress of a given game fast. However, I can't have both using just SK. When I chose one, the other operation will require a scan.
Therefore, I was thinking about using LSI which will add an overhead to the whole table as noted by Amazon here:
Every secondary index means more work for DynamoDB. When you add, delete, or replace items in a table that has local secondary indexes, DynamoDB will use additional write capacity units to update the relevant indexes.
I estimate maximum thousands of types games and I wonder whether it's worth using LSI or whether it's better to use scans for the other operation I choose.
Does anyone has any real experience with such problem? I was not able to find anything on this topic.

When you are designing DynamoDB tables, the main cost factor comes with IOPS for reads and writes.
This is why avoiding scans are usually better. Scans will consume a significant amount of read IOPS and it will increase with the number of items in the table since scan needs to read all the items in the table before returning the matching items.
Then coming back to your use-case of using SK for progress, it would be better to use attributes and define Secondary Indexes, since you will need to update the state later on (Which is not possible with PK and SK in the table).
So based on your use-case and the information given in the question you can define the schema as;
PK- UserID
SK- GameID
GSI- Progress (PK)
Query all games by progress fast
GSI Progress (PK)
Note: if this is for a particular user; you can change it to LSI Progress.
Query progress of a given game fast (Assuming that for a given user)
Query using UserID (PK) and GameID (SK) of the Table

How do I optimize my DynamoDB table secondary global index so that records are evenly distributed while still keeping all records sortable?

Related to this question, I'm looking for more a more specific answer. In an effort to keep this non-subjective, here is a full thought process for creating an activities table with a stuck point that can be finished with a quick example answer.
In an effort to better understand DynamoDB, I'm creating a personal website that contains an activity feed from a DynamoDB table. The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
Different types of activities will include blog posts, projects, twitter post references, LinkedIn post references, etc. Using the activity type as a partition key would not be wise as my activity is highly weighted, mostly on the twitter side, hardly ever creating blog posts.
A unique activity id seems to be the best option for evenly distributing activities across DynamoDB partitions. However, this completely removes the ability to sort activities to start, as queries require a partition id to be known first. This is where a secondary global index (SGI) will be helpful. With this, a sort key will not be required on the primary partition key, but paired in an SGI.
This is part where I'm stuck. What do I base the SGI partition key on? At the moment I'm thinking of a single value "activity" for all activities with a sort key of "date", but that is a single partition for all entries. Will a single SGI partition key value limit performance in this project?
Note that this is a small scale project. However, I'm thinking about large scale projects while building this one, attempting to create the best DynamoDB table possible in regards to optimized partition distribution, while still keeping it flexible for sorting all table records.

Consider GSI (Global Secondary Index) same as Main Table indexes while designing your schema as they also get Read/Write provisioning limits and are subject to hot partition throttling as well which back pressures on main table in other words if your GSI gets throttled then your main table will start throttling requests.
Will a single SGI partition key value limit performance in this project?
Single partition for complete table is definitely misuse of DDB scalable capability.
The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
You can sort across partitions using GSI but you will again need partition key for your GSI and if that partition key is not distributed enough then you get into problems I mentioned above.
DDB is powerful for put/get operations if modeled right and for fairly simple queries with some filters. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
For your specific need its not directly possible to get scalable solution from DDB but we still have few options
Option 1:
We can model the data such that it is fairly distributed for writes and will need extra work while reading it back, this pattern is also known as Randomizing Across Multiple Partition Key Values. Since you don't want to access specific item for given time this will work for us.
Idea is to create fixed set (say 1 to 100) and randomly pick a number from it to append to creation date (not timestamp) and have creation timestamps as sort key.
This will distribute your load across multiple random partitions but increases the read complexity as you will need to query all partitions and merge to get final sort view for that date.
Option 2:
Use multiple tables for hot and cold data as it is time series based data. For info read
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
Option 3:
Scan? Not a good choice if we talk about scalability and when your data grows but for fairly small set of data it surely helps so mentioning it.
These are just an example not saying a good fit for your usecase.
So here is a thought process question for you: write down all your use-cases and access patterns. Figure out their importance which are fine with eventual consistency which are not and see if DDB is good fit for them at first place, don't be tempted to use DDB and then struggling with access pattern scalability.
Also read https://stackoverflow.com/a/38790120/962545 for more questions you must be asking yourself before restricting yourself for specific access pattern you want from DDB.
Don't forget to read best practices: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

nosql/dynamodb hash and range use case

It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!

You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js