Cost of adding a Global secondary Index to an existing DynamoDB Table - amazon-web-services

There is an existing table in DynamoDB with few billion records and I want to add a Global Secondary Index (GSI) to it. Does this consume any Read Request Units (RRU) from the table capacity OR a Write Request Units (WRU) for the index. I am interested to understand any costing involved related to this operation.
I am aware of the storage costs involved with respect to GSI (post creation).

Enabling a GSI won't cause more RRUs to be consumed on the base table. It uses a similar mechanism to DynamoDB streams for the replication to the underlying index table.
The GSI will consume its own Write Request Units when changes are written to it. Reading from the GSI also causes RRUs for that index to be consumed.
For cost estimation, you can focus on the number and size of items replicated to the index and the frequency of changes.

Adding GSI won't use any RCUs. it only uses WRUs and no. of WRUs depends on size of data and projection attributes. for any data up to 1 KB, it uses 1 WRUs. here is the official document from AWS: using global secondary indexs

Related

Whether projection fields improve query performance in Dynamodb?

I use Dynamodb to store User data. Each user has many fields like age, gender, first/last name, address etc. I need to support a query API which response first, last, middle name only, without other fields.
In order to provide a better performance, I have two solutions:
Create a GSI which only includes those query fields. It will make each row very small.
Query the table with projection fields parameter including those query fields.
The item size is 1KB with 20 attributes. 1MB is the maximum data returned from one query. So I should receive 1024 items from querying the main index. If I use field projection to reduce the number of fields, will it give me more items in the response?
Based on dynamodb only response maximum 1MB data, which solution is better for me to use?
What you are trying to achieve is called "Sparse indexes".
Without knowing the table traffic pattern and historical amount of data. Another consideration is the amount of RCU (read capacity units) used for the operation.
FilterExpression is applied after a Query finishes, but before the results are returned.
Link to Documentation
With that in mind, the amount of RCU used by the FilterExpression solution will grow based on the number of fields/data the item has.
You are increasing your costs over time and need to worry about the item size and amount of fields it has.
A review of how RCU works:
DynamoDB read requests can be either strongly consistent, eventually consistent, or transactional.
A strongly consistent read request of an item up to 4 KB requires one read request unit.
An eventually consistent read request of an item up to 4 KB requires one-half read request unit.
A transactional read request of an item up to 4 KB requires two read request units.
Link to documentation
You can use GSI to have a separate throughput and control the used RCU capacity. The amount of data that will be transferred can be predictable. The RCU utilization will be based on the index entries only (first, last, middle and name)
You will need to update your application to use the new index and work with eventually consistent reads. GSI doesn't have support for a strongly consistent read.
Global secondary indexes support eventually consistent reads, each of which consume one half of a read capacity unit. This means that a single global secondary index query can retrieve up to 2 × 4 KB = 8 KB per read capacity unit.
For global secondary index queries, DynamoDB calculates the provisioned read activity in the same way as it does for queries against tables. The only difference is that the calculation is based on the sizes of the index entries, rather than the size of the item in the base table.
Link to documentation
Returning to your question: "which solution is better for me to use?"
Do you need strongly consistent reads? You need to use the table base index with FilterExpression. Otherwise, use GSI.
A good reading is this article: When to use (and when not to use) DynamoDB Filter Expressions
First of all it's important to note that DynamoDBs 1MB limit is not a blocker, it's there for performance reasons.
Your use case seems to want to unnecessarily reduce your payload to below the 1MB limit. However, you should just introduce pagination.
DynamoDB paginates the results from Query operations. With pagination, the Query results are divided into "pages" of data that are 1 MB in size (or less). An application can process the first page of results, then the second page, and so on.
The LastEvaluatedKey from a Query response should be used as the ExclusiveStartKey for the next Query request. If there is not a LastEvaluatedKey element in a Query response, then you have retrieved the final page of results. If LastEvaluatedKey is not empty, it does not necessarily mean that there is more data in the result set. The only way to know when you have reached the end of the result set is when LastEvaluatedKey is empty.
Ref
GSI or ProjectionExpression
This ultimately depends on what you need. For example, if you simply just want certain attributes and the base tables keys are suitable for your access patterns then I would 100% use a ProjectionExpression and paginate the results until I have all the data.
You should only create a GSI should the keys of the base table not suit your access pattern needs. GSI will increase your table costs and you will be storing more data and consuming extra throughput when your use-case doesn't need to.

What happens when a partition in a DynamoDB table with a local secondary index exceeds capacity?

I'm having a bit of trouble reconciling documentation on how DynamoDB behaves when a partition exceeds capacity of 10GB when there is a local secondary index.
In this article about how adaptive capacity makes many DynamoDB best practices / constraints obsolete, it is said that
DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB.
But, in this article about adaptive capacity
Adaptive capacity will not split item collections across multiple partitions of the table when there is a local secondary index on the table.
I know the information in these two articles aren't mutually exclusive, but it seems weird that the first article wouldn't at least mention the tremendous downside of having a local secondary index if that were the case.
So, what exactly happens when a partition exceeds capacity with a local secondary index? Does it just throttle the DynamoDB table since it can't split the partition?
First of all please feel free to provide feedback directly on any documentation page which you feel is missing information, that will cut a ticket internally to the team responsible.
To answer your question, when a partition with an LSI grows to 10GB in size you will begin to receive ItemCollectionSizeLimitExceededException which will not allow you to add further data to that partition, you will also be blocked from updating items if the size of the item grows from the update. You will be allowed to delete and read items.
More information can be found here:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.ItemCollections.SizeLimit

Assigning the same partition key to all items in a DynamoDB table

I need to be able to run some range-based queries on my DynamoDB table, such as int_attribute > 5, or starts_with(string_attribute, "foo"). These can all be answered by creating a global or local secondary index and then submitting a Query to these indexes. However, running a Query requires that you also provide a single value of the partition key to restrict the query set. Neither of these queries has a strict equality condition, so I am therefore considering giving all the items in my Dynamo table the same partition key, and distinguishing them only with the sort key. My dataset is will within the 10 GB partition size limit.
Are there any catastrophic issues that might occur if I do this?
Yes, you can create a GSI where every item goes under the same partition key. The thing to be aware of is you'll generally be putting all those writes into the same physical partition, each of which has a max update rate of 1,000 WCU.
If your update rate is below that, proceed. If your update rate is above that, you'll want to follow a pattern of sharding the GSI partition key value so it spreads across more partitions.
Say you require 10,000 WCU for the GSI. You can assign each item's GSI PK value to a random value-{x} where x is 0 to 9. Then yes, at query time you do 10 queries and put the results back together yourself. This approach can scale as large as you need.

Can DynamoDB reads on GSI with badly chosen partition key affect the read/write for the table

I have a DynamoDB table with a good partition key (PK=uuid) but there is a GSI (PK=type, SK=created) where type has only 6 unique values and created is epoch time.
My question is if I start to do a lot of reads with this GSI, will that affect the performance of the whole table? I see that the read capacity for both the table and the GSI is not shared according to this AWS docs but what will happen behind the scene if we start to use this GSI a lot? Will Dynamodb writes get impacted?
A global secondary index has separate capacity units, to avoid performance on the table itself.
A global secondary index is stored in its own partition space away from the base table and scales separately from the base table.
Performance on your table can only be impacted if its own credits are depleted. The global secondary index sits in its own partition space which can be treated as if has its own boundaries.
In addition as DynamoDB uses separate credits for read (RCU) and write (WCU) these 2 actions would never have a performance impact on the other.

Indexing notifications table in DynamoDB

I am going to implement a notification system, and I am trying to figure out a good way to store notifications within a database. I have a web application that uses a PostgreSQL database, but a relational database does not seem ideal for this use case; I want to support various types of notifications, each including different data, though a subset of the data is common for all types of notifications. Therefore I was thinking that a NoSQL database is probably better than trying to normalize a schema in a relational database, as this would be quite tricky.
My application is hosted in Amazon Web Services (AWS), and I have been looking a bit at DynamoDB for storing the notifications. This is because it is managed, so I do not have to deal with the operations of it. Ideally, I'd like to have used MongoDB, but I'd really prefer not having to deal with the operations of the database myself. I have been trying to come up with a way to do what I want in DynamoDB, but I have been struggling, and therefore I have a few questions.
Suppose that I want to store the following data for each notification:
An ID
User ID of the receiver of the notification
Notification type
Timestamp
Whether or not it has been read/seen
Meta data about the notification/event (no querying necessary for this)
Now, I would like to be able to query for the most recent X notifications for a given user. Also, in another query, I'd like to fetch the number of unread notifications for a particular user. I am trying to figure out a way that I can index my table to be able to do this efficiently.
I can rule out simply having a hash primary key, as I would not be doing lookups by simply a hash key. I don't know if a "hash and range primary key" would help me here, as I don't know which attribute to put as the range key. Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key? Then perhaps a secondary index could help me to sort by the timestamp, if this is even possible.
I also looked at global secondary indexes, but the problem with these are that when querying the index, DynamoDB can only return attributes that are projected into the index - and since I would want all attributes to be returned, then I would effectively have to duplicate all of my data, which seems rather ridiculous.
How can I index my notifications table to support my use case? Is it even possible, or do you have any other recommendations?
Motivation Note: When using a Cloud Storage like DynamoDB we have to be aware of the Storage Model because that will directly impact
your performance, scalability, and financial costs. It is different
than working with a local database because you pay not only for the
data that you store but also the operations that you perform against
the data. Deleting a record is a WRITE operation for example, so if
you don't have an efficient plan for clean up (and your case being
Time Series Data specially needs one), you will pay the price. Your
Data Model will not show problems when dealing with small data volume
but can definitely ruin your plans when you need to scale. That being
said, decisions like creating (or not) an index, defining proper
attributes for your keys, creating table segmentation, and etc will
make the entire difference down the road. Choosing DynamoDB (or more
generically speaking, a Key-Value store) as any other architectural
decision comes with a trade-off, you need to clearly understand
certain concepts about the Storage Model to be able to use the tool
efficiently, choosing the right keys is indeed important but only the
tip of the iceberg. For example, if you overlook the fact that you are
dealing with Time Series Data, no matter what primary keys or index
you define, your provisioned throughput will not be optimized because
it is spread throughout your entire table (and its partitions) and NOT
ONLY THE DATA THAT IS FREQUENTLY ACCESSED, meaning that unused data is
directly impacting your throughput just because it is part of the same
table. This leads to cases where the
ProvisionedThroughputExceededException is thrown "unexpectedly" when
you know for sure that your provisioned throughput should be enough for your
demand, however, the TABLE PARTITION that is being unevenly accessed
has reached its limits (more details here).
The post below has more details, but I wanted to give you some motivation to read through it and understand that although you can certainly find an easier solution for now, it might mean starting from the scratch in the near future when you hit a wall (the "wall" might come as high financial costs, limitations on performance and scalability, or a combination of all).
Q: Could I have a unique notification ID as the hash key and the user ID as the range key? Would that allow me to do lookups only by the range key, i.e. without providing the hash key?
A: DynamoDB is a Key-Value storage meaning that the most efficient queries use the entire Key (Hash or Hash-Range). Using the Scan operation to actually perform a query just because you don't have your Key is definitely a sign of deficiency in your Data Model in regards to your requirements. There are a few things to consider and many options to avoid this problem (more details below).
Now before moving on, I would suggest you reading this quick post to clearly understand the difference between Hash Key and Hash+Range Key:
DynamoDB: When to use what PK type?
Your case is a typical Time Series Data scenario where your records become obsolete as the time goes by. There are two main factors you need to be careful about:
Make sure your tables have even access patterns
If you put all your notifications in a single table and the most recent ones are accessed more frequently, your provisioned throughput will not be used efficiently.
You should group the most accessed items in a single table so the provisioned throughput can be properly adjusted for the required access. Additionally, make sure you properly define a Hash Key that will allow even distribution of your data across multiple partitions.
The obsolete data is deleted with the most efficient way (effort, performance and cost wise)
The documentation suggests segmenting the data in different tables so you can delete or backup the entire table once the records become obsolete (see more details below).
Here is the section from the documentation that explains best practices related to Time Series Data:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
For example, You could have your tables segmented by month:
Notifications_April, Notifications_May, etc
Q: I would like to be able to query for the most recent X notifications for a given user.
A: I would suggest using the Query operation and querying using only the Hash Key (UserId) having the Range Key to sort the notifications by the Timestamp (Date and Time).
Hash Key: UserId
Range Key: Timestamp
Note: A better solution would be the Hash Key to not only have the UserId but also another concatenated information that you could calculate before querying to make sure your Hash Key grants you even access patterns to your data. For example, you can start to have hot partitions if notifications from specific users are more accessed than others... having an additional information in the Hash Key would mitigate this risk.
Q: I'd like to fetch the number of unread notifications for a particular user.
A: Create a Global Secondary Index as a Sparse Index having the UserId as the Hash Key and Unread as the Range Key.
Example:
Index Name: Notifications_April_Unread
Hash Key: UserId
Range Key : Unuread
When you query this index by Hash Key (UserId) you would automatically have all unread notifications with no unnecessary scans through notifications which are not relevant to this case. Keep in mind that the original Primary Key from the table is automatically projected into the index, so in case you need to get more information about the notification you can always resort to those attributes to perform a GetItem or BatchGetItem on the original table.
Note: You can explore the idea of using different attributes other than the 'Unread' flag, the important thing is to keep in mind that a Sparse Index can help you on this Use Case (more details below).
Detailed Explanation:
I would have a sparse index to make sure that you can query a reduced dataset to do the count. In your case you can have an attribute "unread" to flag if the notification was read or not, and use that attribute to create the Sparse Index. When the user reads the notification you simply remove that attribute from the notification so it doesn't show up in the index anymore. Here are some guidelines from the documentation that clearly apply to your scenario:
Take Advantage of Sparse Indexes
For any item in a table, DynamoDB will only write a corresponding
index entry if the index range key
attribute value is present in the item. If the range key attribute
does not appear in every table item, the index is said to be sparse.
[...]
To track open orders, you can create an index on CustomerId (hash) and
IsOpen (range). Only those orders in the table with IsOpen defined
will appear in the index. Your application can then quickly and
efficiently find the orders that are still open by querying the index.
If you had thousands of orders, for example, but only a small number
that are open, the application can query the index and return the
OrderId of each open order. Your application will perform
significantly fewer reads than it would take to scan the entire
CustomerOrders table. [...]
Instead of writing an arbitrary value into the IsOpen attribute, you
can use a different attribute that will result in a useful sort order
in the index. To do this, you can create an OrderOpenDate attribute
and set it to the date on which the order was placed (and still delete
the attribute once the order is fulfilled), and create the OpenOrders
index with the schema CustomerId (hash) and OrderOpenDate (range).
This way when you query your index, the items will be returned in a
more useful sort order.[...]
Such a query can be very efficient, because the number of items in the
index will be significantly fewer than the number of items in the
table. In addition, the fewer table attributes you project into the
index, the fewer read capacity units you will consume from the index.
Source:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForGSI.html#GuidelinesForGSI.SparseIndexes
Find below some references to the operations that you will need to programmatically create and delete tables:
Create Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_CreateTable.html
Delete Table
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DeleteTable.html
I'm an active user of DynamoDB and here is what I would do... Firstly, I'm assuming that you need to access notifications individually (e.g. to mark them as read/seen), in addition to getting the latest notifications by user_id.
Table design:
NotificationsTable
id - Hash key
user_id
timestamp
...
UserNotificationsIndex (Global Secondary Index)
user_id - Hash key
timestamp - Range key
id
When you query the UserNotificationsIndex, you set the user_id of the user whose notifications you want and ScanIndexForward to false, and DynamoDB will return the notification ids for that user in reverse chronological order. You can optionally set a limit on how many results you want returned, or get a max of 1 MB.
With regards to projecting attributes, you'll either have to project the attributes you need into the index, or you can simply project the id and then write "hydrate" functionality in your code that does a look up on each ID and returns the specific fields that you need.
If you really don't like that, here is an alternate solution for you... Set your id as your timestamp. For example, I would use the # of milliseconds since a custom epoch (e.g. Jan 1, 2015). Here is an alternate table design:
NotificationsTable
user_id - Hash key
id/timestamp - Range key
Now you can query the NotificationsTable directly, setting the user_id appropriately and setting ScanIndexForward to false on the sort of the Range key. Of course, this assumes that you won't have a collision where a user gets 2 notifications in the same millisecond. This should be unlikely, but I don't know the scale of your system.