I have a single-table design for my app. However, I have certain rows in the table that have important information that I plan to use to query different kinds of data. Let me explain. My app handles alarms triggered by users. When an alarm gets triggered I record a lot of info about that alert. My goals is to create GSIs so I can retrieve and sort all the info about that alarm that was triggered. Let me give you an example of a row in my table.
PK
SK
GSI1PK
GSI1SK
GSI2PK
GSI2SK
GSI3PK
GSI3SK
GSI4PK
GSI4SK
GSI5PK
GSI5SK
OtherProperties
ShipmentReceived
AL#TR#2020-08-19T23:37:41.513Z
AL#TR
2020-08-19T23:37:41.513Z
AL#TR#LO
Building1#WingA#Floor1#OfficeB#2020-08-19T23:37:41.513Z
user#example.com
2020-08-19T23:37:41.513Z
1234567
2020-08-19T23:37:41.513Z
AL#TR#HOW
PC#KS
Other values go in other columns
NOTE: AL#TR means: "Alarm Triggered" and AL#TR#LO means "Alarm triggered from location". AL#TR#HOW indicates how the alarm was triggered. 1234567 is a "device ID" used to trigger the alarm.
This kind of structure allows me to query for all sorts of interesting data. for example:
All of the ShipmentReceived alarms sorted by date
GSI1: I can get all of the alarms triggered at the company and sort them by date (That includes ShipmentReceived, PackageSent, etc)
GSI2: I can get all of the alarms triggered at a certain location and I can sort them by date.
GSI3: I can get all of the alarms triggered by a specific user and I can sort by date.
GSI4: I can get all of the alarms triggered by a specific device and I can sort them by date.
GSI5: Allows me to sort the alarms by method used to trigger them.
I am reading the DynamoDB documentation and I see that it says that it is not recommended to use indexes on items that are not queried often. A lot of these GSIs will not be queried often at all. Just very sporadically.
My question is, am I doing this wrong by creating 5 different GSIs? in this case? Is there a better way to model this data? I thought about this, maybe I can insert multiple rows with related information instead of having everything in one row, but I do not know if that is a better approach. Any other ideas?
I'm on the DynamoDB team in Seattle, and this response is from one of my colleagues:
"Anytime you need to group or sort the same entities differently, you need to make a new GSI for that access pattern. When you have multiple entity types stored in the same table you can reuse the GSI (aka GSI overloading) for those access patterns on different entities. But in your case, all of the access patterns are about grouping and sorting alarm entities so each would need a different GSI.
"However, GSIs exist to speed up or make cheaper read requests with the trade-off being a higher write expense (to keep the GSIs updated). This makes sense in access patterns that have a high read:write ratio and where the response must come back quickly. But for read access patterns that are done infrequently and for which there isn't a low-latency requirement, it might be cheaper to simply do a Scan operation compared to the cost of having a GSI. For example, for a batch job that runs once a day or once a week it might be cheaper to scan the table once a day or once a week."
Related
I have an application being built using AWS AppSync with a primary focus of sending telemetry data from a mobile application. I am stuck on how to partition and structure the DynamoDB tables for this as the users of the application belong to different organizations, in those organizations there will be admins who are able to view the data specific to their organization.
OrganizationA
-->Admin # View all the telemetry data
---->User # Send the telemetry data from their mobile application
Based on some research from these resources,
Link 1.
Link 2.
The advised manner is to create tables for individual periods i.e., a table for every day with the telemetry readings.
Example(not sure what pk is in this example):
The way in which I am planning to separate the users using AWS Cognito is by attaching a custom attribute when the user signs up such as Organization and Role(Admin or User) as per this answer then use a Pre-Signup Lambda Trigger.
How should I achieve this?
Since you really don't need users from one organization to read data from another organization, and for all your access patterns you will always know the organization id, then that attribute should be a factor in partitioning: either at the table level, or at the partition key level.
Then you have to determine if you can simply use the organization id as a partition key, or you need to further partition -- say, by concatenating the organization id and the hour value for each sample. This will depend on the amount of data you expect to generate by each organization in a given day. The tradeoff being more granular partitioning vs. cost of querying for data.
If organizations generate small amounts of data each day (say, a few events an hour) then just use organization id as the partition key. Otherwise, partition the data further.
In all of the above, the sort key should probably be the timestamp of the events, either with second or millisecond precision depending on your needs. That way your queries can retrieve ordered time-series data.
Keep in mind that when you make queries, you may need to execute multiple queries and stick the results together in your application to fully represent the results as the range may span multiple partitions, or even multiple tables.
as described in https://aws.amazon.com/blogs/database/choosing-the-right-dynamodb-partition-key/ , the partition key should be unique.
I am building an application that needs to store subscriptions to a topic (think of a chat app). Millions of those subscriptions would need to be stored in the database and when ever a message should be emitted to the subscribers, the application needs to get all subscribers from the table.
Naive approach
The naive approach would be, to design a primary key like:
SUBSCRIPTIONS|<topic>
The sortkey would then order all subscriptions for the <topic> by time of subscription, region and a few other criteria.
Unfortunately the partition key is by far not unique but would allow to fetch all subscriptions in a blink.
Also, considering a maximum table size sets a hard limit to the number of subscriptions that can be held in a partition, and thus the maximum number of subscription in general for this design. So, this is designed to fail scalability.
Alternative
The other way of designing it would be to use something like
SUBSCRIPTIONS|<clientId>
to hold each and every subscription separately per client and move the <topic> into the sortkey. This would allow to scale the table (partionining) far better, but would need scans to find all subscribers for a certain <topic>.
An index might help here, but how does an index scale over multiple partitions? and how will it perform?
I’m quite new to NoSQL and DynamoDB and I used to RDBMS. I’m designing database for a game and we're using DynamoDB and AWS Lambda for our backend. I created a table name “Users” for player profile that contains the user information and resources. Because the game has inventory system I also created a table name “UserItems”.
It’s all good until I realized DynamoDB don’t have transaction and any operation that is executed on both table (for example using an item that increase resource) has a chance of failure on one table while success on other and will cause missing data which affect our customers.
So I was thinking maybe my multiple tables design is not good since it’s a habit of me to design multiple table when I’m working with RDBMS. Which let me to think of storing the entire “UserItems” as hash in “Users” but I’m not sure this is a good practice because the size of a single row in Users table will be really big (we may have 500 unique items per users) and each time I pull or put data from/to “Users” (most of the time don’t need “UserItems” data) the read/write throughput will be also really large.
What should I do, keep the multiple tables design and handle transaction manually or switch to single table design? Or maybe there is a 3rd option?
Updated: more information about my use case
Currently I have 2 tables
Users: UserId (key), Username, Gold
UserItems: UserId (partition key), ItemId (sort key), Name, GoldValue
Scenarios:
User buy an item: Users.Gold will be deduced, new UserItem will be add to UserItems table.
User sell an item: Users.Gold will be increased, the Item will be deleted from UserItems table.
In both scenarios above I will have to do 2 update operation for 2 tables which without transaction there is a chance one of them failed.
To solve that I consider using single table solution which is a single Users table with 4 columns UserId(key), Username, Gold, UserItems. However there are two things I'm worried about:
Data in UserItems might be come to big for a single cell because one user could have up to 500 items.
To add/delete item I have to pull the UserItems from dynamodb, add/delete item and then put it back into Users. So I have to do 1 read and 1 write operation for 1 action. And because of issue (1) the read/write data size could become really big.
FWIW, the AWS documentation on NoSQL Design for DynamoDB suggests to use a single table:
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
NoSql database is best suited for non-trasactional data. If you bring normalization(splitting your data into multiple tables) into noSQL, then you are beating the whole purpose of it. If performance is what matters most, then you should consider only having a single table for your use case. DynamoDB supports Range Keys, and also supports Secondary Indices. For your usecase, it would be better to redesign your table to use Range Keys.
If you can share more details about your current table, maybe i can help you with more inputs.
I'm trying to figure out how to model the following data in AWS DynamoDB table.
I have a lot of IOT devices, each sends telemetry data every few seconds.
Attributes
device_id
timestamp
malware_name
company_name
action_performed (two possible values)
Queries
Show all incidents that happened in the last week.
Show all incidents for a specific device_id.
Show all incidents with action "unable_to_remove".
show all incidents related to specific malware.
Show all incidents related to specific company.
Thoughts
I understand that I can add GSI's for each attribute, but I would like to use GSI's only if there is no other choice as it costs me more money.
What would be the main primary-key (partition-key:sort-key) ?
Please share you thoughts, I care about them more than I care about the perfect answer as I'm trying to learn how to think and what to consider instead of having an answer for a specific question.
Thanks a lot !
If you absolutely need the querability patterns mentioned, you have no way out but create GSIs for each. That too has its set of caveats:
For query #1, your GSI would be incident_date (or whatever) as partition-key and device_id as sort-key. This might lead to hot partitioning in DynamoDB, based on your access patterns.
There is a limit of 5 GSIs per table, that you'll use up right away. What'll you do if you need to support another kind of query in future?
While evaluating pros and cons of using NoSQL for a given situation, one needs to consider both read and write access patterns. So, the question you should ask is, why DynamoDB?
For e.g., do you really need realtime queries? If not, you can use DynamoDB as the main database and periodically sync data (using AWS Lambda or Kinesis Firehose) to EMR or Redshift for later batch processing.
Edit: Proposed primary key:
device_id as partition-key and incident_date as sort-key, if you know that no 2 or more incidents, for a given device_id, can come at exact same time.
If above doesn't work, then incident_id as partition-key and incident_date as sort-key.
I am thinking of building a chat app with AWS DynamoDB. The app will support 1:1 and group chats.
I want to create one table for each one of the chats, where there is a record for each sent chat text line. Is DynamoDB suitable for this kind of job?
I am also thinking of merging both tables. But is this a good idea, if there are – let's assume – 100k or 1000k users?
I think you may run into problems with the read capacity on your table. The write capacity should be ok, as there are not so many messages coming in per second (e.g. 10 or so), but you'll need to constantly read from it for all users, so that'll be expensive.
If you want to use DynamoDB just as storage and distribute the chat messages like in any normal chat over the network, then it may make sense, depending on your use cases. You could, assuming you have a hash key UserId and Timestamp, query all messages from a specific user during a specific period as a result. If you want, however, search within the chat text (a much more useful feature, probably), then DynamoDB won't work per se. It's not like SQL, where you could do a LIKE '%abc%' query (which isn't a good idea in SQL either).
Probably you're better off using S3 as data storage and ElasticSearch as search instrument. If you require the aforementioned use case "get all messages from user X in timespan S" (as a simple example) you could additionally use DynamoDB to store metadata, such as UserId, Timestamp, PositionInFile or something like that.