DynamoDB record size increasing with time

DynamoDB record size increasing with time - amazon-web-services

I have a customer table in DynamoDB with basic attributes like name, dob, zipcode, email, etc. I want to add another attribute to it which will keep increasing with time. For example, each time the user clicks on a product (item), I want to add that to the record so that I have the full snapshot of the customer's profile in a single value indexed by the customerId. So, my new attribute would be called viewedItems and would be a list of itemIds viewed (along with the timestamp).
However, given the 4KB size limit for DynamoDB value, it is going to be surpassed with time as I keep adding the clicked products to the customer profile.
How can I best define my objects so as to perform the following?
Access the full profile of the customer by customerId, including the views.
Access time filtered profile of the customer (like all interactions since last N days), in which case the viewed items should be filtered by the given time range.
Scan the entire table with a time filter on viewedItems.
The query needs to be performant as the profile could be pulled at request time.
Ability to update individual customer record (via a batch job, for example, that updates each customer's record if need be).
One way to do this would be to create a different table (say customer_viewed_items) with hash key customerId and a range key timestamp with value being the itemId that the customer viewed. But this looks like an increasingly complicated schema - not to mention twice the cost involved in accessing the item. If I have to create another attribute based on (say) "bought" items, then I'll need to create another table. So, the solution I have in mind does not seem good to me.
Would really appreciate if you could help suggest a better schema/approach.

As soon as you really don't know how many items will be viewed by user (edge case - user opens all items sequentially, multiple times) - you cannot store this information in single dynamodb record.
The only solution is to normalize your database and create separate table like you've described.
Now, next question - how to minimize retrieval cost in such scheme? Usually you don't need to fetch all viewed items, probably you want to display some of them, then you need to fetch only last X.
You can cache such items in main table customer, ie - create field "lastXviewedItems" and updated it, so it contains only limited number of items without breaking size limit, of course for BI analysis - you will have to store them in 2nd table too.

Related

Is this a reasonable way to design this DynamoDB table? Alternatives?

Our team has started to use AWS and one of our projects will require storing approval statuses of various recommendations in a table.
There are various things that identify a single recommendation, let's say they're : State, ApplicationDate, LocationID, and Phase. And then a bunch of attributes corresponding to the recommendation (title, volume, etc. etc.)
The use case will often require grabbing all entries for a given State and ApplicationDate (and then we will look at all the LocationId and Phase items that correspond to it) for review from a UI. Items are added to the table one at a time for a given Station, ApplicationDate, LocationId, Phase and updated frequently.
A dev with a little more AWS experience mentioned we should probably use State+ApplicationDate as the partition key, and LocationId+Phase as the sort key. These two pieces combined would make the primary key. I generally understand this, but how does that work if we start getting multiple recommendations for the same primary key? I figure we either are ok with just overwriting what was previously there, OR we have to add some other attribute so we can write a recommendation for the State+ApplicationDate/LocationId+Phase multiple times and get all previous values if we need to... but that would require adding something to the primary key right? Would that be like adding some kind of unique value to the sort key? Or for example, if we need to do status and want to record different values at different statuses, would we just need to add status to the sort key?
Does this sound like a reasonable approach or should I be exploring a different NAWS offering for storing this data?

Use a time-based id property, such as a ULID or KSID. This will provide randomness to avoid overwriting data, but also provide a time-based sorting of your data when used as part of a sort key
Because the id value is random, you will want to add it to your sort key for the table or index where you perform your list operations, and reserve the pk for known values that can be specified exactly.
It sounds like the 'State' is a value that can change. You can't update an item's key attributes on the table, so it is more common to use these attributes in a key for a GSI if they are needed to list data.
Given the above, an alternative design is to use the LocationId as the pk, the random id value as the sk, and a GSI with the GSI with 'State' as the pk and the random id as the sk. Or, if you want to list the items by State -> Phase -> date, the GSI sk could be a concatenation of the Phase and id property. The above pattern gives you another list mechanism using the LocationId + timestamp of the recommendation create time.

DynamoDB Many-to-Many relations

I have a problem modeling my data in DynamoDB. My APP creates notes with the possibility to share a note with other user and allow the other user to update the Note (as done by https://keep.google.com/).
As I need to share notes between users, I decide that my primary table key will be the identifier of a Note.
Then I come with the following data-model for my DynamoDB tables:
Primary Table :(PK = NoteId, SK = Type)
Secondary Table: (GSK = userId, SK = noteId )
The "Type" will indicate if it is the BODY of the note (where information regarding the note will be save) or an identifier that indicate if the note has been shared with other user.
But I do have a problem: I use the secondary global key to retrieve all the notes for a user.
Once I have the list of noteId(s), I will enquiry my primary table to get all shared-notes for the user (as the notes for the user are already present in the SGK).
However, for doing this I need to use the function: "BatchGetItem".
The problem is that it is only allow to get 100 items and 16MB data.
In case of more than 100 shared-notes I have to call this functions several times. Moreover in case the data exceeds 16MB I need to implement a mechanism to read the rest of the requested data.
This operation could get really slow depending on the data size and number of shareId.
As you can imagine this is easily solved using a RDB and "join".
But the idea here is to use DynamoDB.
Data Access patterns:
Get all Notes by userId (own and shared)
Add a shared by userId and sharedId.
Get rights by noteId and userId.
Update a note by Id
Delete a note by Id
Any ideas of how I can change my data-model to improve the access pattern to read all notes?

Modelling your schema to utilise item collections will allow you to use the Query API which does not have a limit of items returned except a 1MB limit that still needs to be paged through.

When it's worth the tradeoff of using local secondary index in DynamoDB?

I've read guidelines for secondary indexes but I'm not sure when the ability to search fast outweighs the disadvantage of scan over attributes. Let me give you an example.
I am saving game progress data for users. The PK is user ID. I need to be able to:
Find out user progress about a particular game.
Get all finished/in progress games for a user.
Thus, I can design my SK as progress_{state} to be able to query all games by progress fast (state represents started/finished) or I can design my SK as progress_{gameId} to be able to query progress of a given game fast. However, I can't have both using just SK. When I chose one, the other operation will require a scan.
Therefore, I was thinking about using LSI which will add an overhead to the whole table as noted by Amazon here:
Every secondary index means more work for DynamoDB. When you add, delete, or replace items in a table that has local secondary indexes, DynamoDB will use additional write capacity units to update the relevant indexes.
I estimate maximum thousands of types games and I wonder whether it's worth using LSI or whether it's better to use scans for the other operation I choose.
Does anyone has any real experience with such problem? I was not able to find anything on this topic.

When you are designing DynamoDB tables, the main cost factor comes with IOPS for reads and writes.
This is why avoiding scans are usually better. Scans will consume a significant amount of read IOPS and it will increase with the number of items in the table since scan needs to read all the items in the table before returning the matching items.
Then coming back to your use-case of using SK for progress, it would be better to use attributes and define Secondary Indexes, since you will need to update the state later on (Which is not possible with PK and SK in the table).
So based on your use-case and the information given in the question you can define the schema as;
PK- UserID
SK- GameID
GSI- Progress (PK)
Query all games by progress fast
GSI Progress (PK)
Note: if this is for a particular user; you can change it to LSI Progress.
Query progress of a given game fast (Assuming that for a given user)
Query using UserID (PK) and GameID (SK) of the Table

MS SQL to DynamoDB migration, what's the best partition key to chose in my case

i am working on a migration from MS Sql to DynamoDB and i'm not sure what's the best hash key for my purpose. In MS SQL i've an item table where i store some product information for different customers, so actually the primary key are two columns customer_id and item_no. In application code i need to query specific items and all items for a customer id, so my first idea was to setup the customer id as hash key and the item no as range key. But is this the best concept in terms of partitioning? I need to import product data daily with 50.000-100.000 products for some larger customers and as far as i know it would be better to have a random hash key. Otherwise the import job will run on one partition only.
Can somebody give me a hint what's the best data model in this case?
Bye,
Peter

It sounds like you need item_no as the partition key, with customer_id as the sort key. Also, in order to query all items for a customer_id efficiently you will want to create a Global Secondary Index on customer_id.
This configuration should give you a good distribution while allowing you to run the queries you have specified.

You are on the right track, you should really be careful on how you are handling write operations as you are executing an import job in a daily basis. Also avoid adding indexes unnecessarily as they will only multiply your writing operations.
Using customer_id as hash key and item_no as range key will provide the best option not only to query but also to upload your data.
As you mentioned, randomization of your customer ids would be very helpful to optimize the use of resources and prevent a possibility of a hot partition. In your case, I would follow the exact example contained in the DynamoDB documentation:
[...] One way to increase the write throughput of this application
would be to randomize the writes across multiple partition key values.
Choose a random number from a fixed set (for example, 1 to 200) and
concatenate it as a suffix [...]
So when you are writing your customer information just randomly assign the suffix to your customer ids, make sure you distribute them evenly (e.g. CustomerXYZ.1, CustomerXYZ.2, ..., CustomerXYZ.200).
To read all of the items you would need to obtain all of the items for each suffix. For example, you would first issue a Query request for the partition key value CustomerXYZ.1, then another Query for CustomerXYZ.2, and so on through CustomerXYZ.200. Because you know the suffix range (on this case 1...200), you only need to query the records appending each suffix to the customer id.
Each query by the hash key CustomerXYZ.n should return a set of items (specified by the range key) from that specific customer, your application would need to merge the results from all of the Query requests.
This will for sure make your life harder to read the records (in terms of the additional requests needed), however, the benefits of optimized throughput and performance will pay off. Remember a hot partition will not only increase your overall financial cost, but will also impact drastically your performance.
If you have a well designed partition key your queries will always return very quickly with minimum cost.
Additionally, make sure your import job does not execute write operations grouped by customer, for example, instead of writing all items from a specific customer in series, sort the write operations so they are distributed across all customers. Even though your customers will be distributed by several partitions (due to the id randomization process), you are better off taking this additional safety measure to prevent a burst of write activity in a single partition. More details below:
From the 'Distribute Write Activity During Data Upload' section of the official DynamoDB documentation:
To fully utilize all of the throughput capacity that has been
provisioned for your tables, you need to distribute your workload
across your partition key values. In this case, by directing an uneven
amount of upload work toward items all with the same partition key
value, you may not be able to fully utilize all of the resources
DynamoDB has provisioned for your table. You can distribute your
upload work by uploading one item from each partition key value first.
Then you repeat the pattern for the next set of sort key values for
all the items until you upload all the data [...]
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
I hope that helps. Regards.

Indexing notifications table in DynamoDB

I am going to implement a notification system, and I am trying to figure out a good way to store notifications within a database. I have a web application that uses a PostgreSQL database, but a relational database does not seem ideal for this use case; I want to support various types of notifications, each including different data, though a subset of the data is common for all types of notifications. Therefore I was thinking that a NoSQL database is probably better than trying to normalize a schema in a relational database, as this would be quite tricky.
My application is hosted in Amazon Web Services (AWS), and I have been looking a bit at DynamoDB for storing the notifications. This is because it is managed, so I do not have to deal with the operations of it. Ideally, I'd like to have used MongoDB, but I'd really prefer not having to deal with the operations of the database myself. I have been trying to come up with a way to do what I want in DynamoDB, but I have been struggling, and therefore I have a few questions.
Suppose that I want to store the following data for each notification:
An ID
User ID of the receiver of the notification
Notification type
Timestamp
Whether or not it has been read/seen
Meta data about the notification/event (no querying necessary for this)
Now, I would like to be able to query for the most recent X notifications for a given user. Also, in another query, I'd like to fetch the number of unread notifications for a particular user. I am trying to figure out a way that I can index my table to be able to do this efficiently.
I can rule out simply having a hash primary key, as I would not be doing lookups by simply a hash key. I don't know if a "hash and range primary key" would help me here, as I don't know which attribute to put as the range key. Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key? Then perhaps a secondary index could help me to sort by the timestamp, if this is even possible.
I also looked at global secondary indexes, but the problem with these are that when querying the index, DynamoDB can only return attributes that are projected into the index - and since I would want all attributes to be returned, then I would effectively have to duplicate all of my data, which seems rather ridiculous.
How can I index my notifications table to support my use case? Is it even possible, or do you have any other recommendations?

Motivation Note: When using a Cloud Storage like DynamoDB we have to be aware of the Storage Model because that will directly impact
your performance, scalability, and financial costs. It is different
than working with a local database because you pay not only for the
data that you store but also the operations that you perform against
the data. Deleting a record is a WRITE operation for example, so if
you don't have an efficient plan for clean up (and your case being
Time Series Data specially needs one), you will pay the price. Your
Data Model will not show problems when dealing with small data volume
but can definitely ruin your plans when you need to scale. That being
said, decisions like creating (or not) an index, defining proper
attributes for your keys, creating table segmentation, and etc will
make the entire difference down the road. Choosing DynamoDB (or more
generically speaking, a Key-Value store) as any other architectural
decision comes with a trade-off, you need to clearly understand
certain concepts about the Storage Model to be able to use the tool
efficiently, choosing the right keys is indeed important but only the
tip of the iceberg. For example, if you overlook the fact that you are
dealing with Time Series Data, no matter what primary keys or index
you define, your provisioned throughput will not be optimized because
it is spread throughout your entire table (and its partitions) and NOT
ONLY THE DATA THAT IS FREQUENTLY ACCESSED, meaning that unused data is
directly impacting your throughput just because it is part of the same
table. This leads to cases where the
ProvisionedThroughputExceededException is thrown "unexpectedly" when
you know for sure that your provisioned throughput should be enough for your
demand, however, the TABLE PARTITION that is being unevenly accessed
has reached its limits (more details here).
The post below has more details, but I wanted to give you some motivation to read through it and understand that although you can certainly find an easier solution for now, it might mean starting from the scratch in the near future when you hit a wall (the "wall" might come as high financial costs, limitations on performance and scalability, or a combination of all).
Q: Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key?
A: DynamoDB is a Key-Value storage meaning that the most efficient queries use the entire Key (Hash or Hash-Range). Using the Scan operation to actually perform a query just because you don't have your Key is definitely a sign of deficiency in your Data Model in regards to your requirements. There are a few things to consider and many options to avoid this problem (more details below).
Now before moving on, I would suggest you reading this quick post to clearly understand the difference between Hash Key and Hash+Range Key:
DynamoDB: When to use what PK type?
Your case is a typical Time Series Data scenario where your records become obsolete as the time goes by. There are two main factors you need to be careful about:
Make sure your tables have even access patterns
If you put all your notifications in a single table and the most recent ones are accessed more frequently, your provisioned throughput will not be used efficiently.
You should group the most accessed items in a single table so the provisioned throughput can be properly adjusted for the required access. Additionally, make sure you properly define a Hash Key that will allow even distribution of your data across multiple partitions.
The obsolete data is deleted with the most efficient way (effort, performance and cost wise)
The documentation suggests segmenting the data in different tables so you can delete or backup the entire table once the records become obsolete (see more details below).
Here is the section from the documentation that explains best practices related to Time Series Data:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
For example, You could have your tables segmented by month:
Notifications_April, Notifications_May, etc
Q: I would like to be able to query for the most recent X notifications for a given user.
A: I would suggest using the Query operation and querying using only the Hash Key (UserId) having the Range Key to sort the notifications by the Timestamp (Date and Time).
Hash Key: UserId
Range Key: Timestamp
Note: A better solution would be the Hash Key to not only have the UserId but also another concatenated information that you could calculate before querying to make sure your Hash Key grants you even access patterns to your data. For example, you can start to have hot partitions if notifications from specific users are more accessed than others... having an additional information in the Hash Key would mitigate this risk.
Q: I'd like to fetch the number of unread notifications for a particular user.
A: Create a Global Secondary Index as a Sparse Index having the UserId as the Hash Key and Unread as the Range Key.
Example:
Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread
When you query this index by Hash Key (UserId) you would automatically have all unread notifications with no unnecessary scans through notifications which are not relevant to this case. Keep in mind that the original Primary Key from the table is automatically projected into the index, so in case you need to get more information about the notification you can always resort to those attributes to perform a GetItem or BatchGetItem on the original table.
Note: You can explore the idea of using different attributes other than the 'Unread' flag, the important thing is to keep in mind that a Sparse Index can help you on this Use Case (more details below).
Detailed Explanation:
I would have a sparse index to make sure that you can query a reduced dataset to do the count. In your case you can have an attribute "unread" to flag if the notification was read or not, and use that attribute to create the Sparse Index. When the user reads the notification you simply remove that attribute from the notification so it doesn't show up in the index anymore. Here are some guidelines from the documentation that clearly apply to your scenario:
Take Advantage of Sparse Indexes
For any item in a table, DynamoDB will only write a corresponding
index entry if the index range key
attribute value is present in the item. If the range key attribute
does not appear in every table item, the index is said to be sparse.
[...]
To track open orders, you can create an index on CustomerId (hash) and
IsOpen (range). Only those orders in the table with IsOpen defined
will appear in the index. Your application can then quickly and
efficiently find the orders that are still open by querying the index.
If you had thousands of orders, for example, but only a small number
that are open, the application can query the index and return the
OrderId of each open order. Your application will perform
significantly fewer reads than it would take to scan the entire
CustomerOrders table. [...]
Instead of writing an arbitrary value into the IsOpen attribute, you
can use a different attribute that will result in a useful sort order
in the index. To do this, you can create an OrderOpenDate attribute
and set it to the date on which the order was placed (and still delete
the attribute once the order is fulfilled), and create the OpenOrders
index with the schema CustomerId (hash) and OrderOpenDate (range).
This way when you query your index, the items will be returned in a
more useful sort order.[...]
Such a query can be very efficient, because the number of items in the
index will be significantly fewer than the number of items in the
table. In addition, the fewer table attributes you project into the
index, the fewer read capacity units you will consume from the index.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes
Find below some references to the operations that you will need to programmatically create and delete tables:
Create Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html
Delete Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html

I'm an active user of DynamoDB and here is what I would do... Firstly, I'm assuming that you need to access notifications individually (e.g. to mark them as read/seen), in addition to getting the latest notifications by user_id.
Table design:
NotificationsTable
id - Hash key
user_id
timestamp
...
UserNotificationsIndex (Global Secondary Index)
user_id - Hash key
timestamp - Range key
id
When you query the UserNotificationsIndex, you set the user_id of the user whose notifications you want and ScanIndexForward to false, and DynamoDB will return the notification ids for that user in reverse chronological order. You can optionally set a limit on how many results you want returned, or get a max of 1 MB.
With regards to projecting attributes, you'll either have to project the attributes you need into the index, or you can simply project the id and then write "hydrate" functionality in your code that does a look up on each ID and returns the specific fields that you need.
If you really don't like that, here is an alternate solution for you... Set your id as your timestamp. For example, I would use the # of milliseconds since a custom epoch (e.g. Jan 1, 2015). Here is an alternate table design:
NotificationsTable
user_id - Hash key
id/timestamp - Range key
Now you can query the NotificationsTable directly, setting the user_id appropriately and setting ScanIndexForward to false on the sort of the Range key. Of course, this assumes that you won't have a collision where a user gets 2 notifications in the same millisecond. This should be unlikely, but I don't know the scale of your system.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js