I have a DynamoDB table with a partition key (UUID) with a few attributes (like name, email, created date etc). Created date is one of the attribute in the item and its format is YYYY-MM-DD. But now there is a requirement change - I have to sort it based on created date and bring the entire data (that is, I cannot just bring the data on a particular partition, but the whole data from all the partitions in a sorted fashion. I know this might take time as DynamoDB to fetch data from all the partitions and sort it after. My question is:
Is the querying possible with the current design? I can see that partition key is required in the query, this is why I am confused because I cannot give a partition key here.
Is there a better way to redesign the table for such a use case?
Thanks in advance.
As your table exists you couldn't change the structure now, and even if you wanted to you would be reliant on UUID as your partition key.
There is functionality however to create a global secondary index for your DynamoDB table.
By using a GSI you can rearrange your data representation to include the creation date as the partition key of your table instead.
The reason why partition keys are important is that data in DynamoDB data is distributed across multiple nodes, with each partition sharing the same node. By performing a query it is more efficient to only communicate with one partition, as there is no need to wait on the other partitions returning results.
Related
The dynamo db will not allow data to be inserted into table unless the value contains the primary key set during table creation.
Dynamodb table:
id (primary key)
device_id
temperature_value
I am sending data from IoT core rule engine into the Dynamodb (Split message into multiple columns of a DynamoDB table (DynamoDBv2)). However, data does not arrive at the dynamo db table if the msg is missing the id attribute.
Is there any way to set primary key to be auto incrementing every time a new data point arrives?
DynamoDB does not support auto incrementing functionality for keys as it might have in a relational database.
Instead this will need to be generated by you at the time of inserting the record into DynamoDB.
There are a few options to generate:
Use a primary key combined of partition key (referencing your sensor id) and a sort key (something such as an event time, or a randomly generated string).
Generate a random string instead and insert this.
Use a seperate data store such as relational or Redis, where you autoincrement a value and use this. This is really not ideal.
Use a seperate DynamoDB table to include this value ensuring you use a transactional write to lock the row and increment, and strongly consistent read to get the latest value. Again this is not ideal
I am new to AWS, while reading the DynamoDB docs I came to know that we can have GSI and partition key on same table.
How does DynamoDB keep the data in the same table on the basis of to keys(partitioned and secondary).
Thanks
DynamoDB actually replicates data changes across to the GSI, in fact AWS recommends the write capacity is equal or higher on a GSI than that of the base table.
To avoid potential throttling, the provisioned write capacity for a global secondary index should be equal or greater than the write capacity of the base table because new updates write to both the base table and global secondary index.
The GSI could be stored on completely different infrastructure than that of the base table.
A global secondary index is stored in its own partition space away from the base table and scales separately from the base table.
Your main table is composed at least of a partition key and an optional sortkey. When you add a GSI, it's actually a replication of the main table using the GSI as the new partition key.
When you update your main table, the changes are propagated to the GSI. There is always a short propagation delay between a write to the parent table and the time when the written data appears in the index. In other words, your applications should take into account that the global secondary index replica is only eventually consistent with the parent table.
I have created a dynamo db table with name- "sample".It has below columns. CreatedDate will have creation time of any records inserted to this table.
Itemid,
ItemName,
ItemDescription,
CreatedDate,
UpdatedDate
I am creating a python-flask based rest api which always fetches last 100 records inserted to this table. This API (python-flask function) does not have any input parameters. It should just return the last records inserted to this table.
Question 1
What should be the partition key for this table? I am using the boto3 library to fetch records from DynamoDB. I prefer not to do scan operation because it may cause performance issues. If I use the query function it asks for a partition key. Since this rest API does not accept any input I am not sure how to use it.
Question 2
Has anyone faced similar situation? And what was done to fix this?
Note: I am pretty much newbie to DynamoDB, NoSQL and Boto
To query your table using CreatedDate without knowing the ItemId, you can use Global Secondary Index write sharding by adding an attribute (e.g., ShardId) containing a (0-N) value to every item that you will use for the global secondary index partition key.
Depending on how your items are distributed against CreatedDate, you can set the ShardId so that it is likely to have evenly distributed access patterns. For example: YYYY, YYYYMM or YYYYMMDD. Then, you create a global secondary index with ShardId as an index partition key and CreatedDate as an index sort key.
Knowing the primary key for your GSI (since the ShardId value is derived from CreatedDate), you can query the table for the 100 most recent items with query's Limit parameter (or LastEvaluatedKey if your items set size is larger than 1 MB of data).
See Using Global Secondary Index Write Sharding for Selective Table Queries.
I'd like to use AWS DynamoDB as a datastore for a data-collection application, where the data schema may vary over time.
For example, initially an Item may represent attributes of people e.g. {name, age}. However, later the schema may be modified to contain {name, age, gender}.
Each schema modification will be tracked and versioned and older data won't need to be migrated - but it may still need to be queried alongside newer data.
Is it an acceptable pattern to store each data-schema change in its own table? Is there a straightforward mechanism to query aggregated data across tables?
Schemas for DynamoDB tables are dynamic in nature. The only thing that needs to be set up upfront is the key name and type. You can add global indexes any time too (indexes with a different partition key). Local indexes, however, those with the same partition key but different sort key, they are added at table creation table. Because of this dynamic schema, you can add new fields, or stop adding them any time.
You need to design tables knowing how would you query them. Queries are quite restricted, you can filter but that's not a fast/cheap operation. Fast queries rely on existing indexes. Queries can fetch from a single table. Joins/unions aren't available.
A table scan is done without any criteria, only filters are available. With filters, data is fetched from disk but can be removed from the returned set. It's an expensive operation in both cost and time. Queries passing a key are faster because they fetch data from a single partition. So you might want to design a key with both a partition (userId for instance) and sort key (item id). It is usual to have compound keys on DynamoDB.
Also it is important to avoid hot spots inside a table. That is, data needs to be fairly distributed inside partition keys.
Reference: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BestPractices.html
Let's say I already have partition key on a table and I'm adding a global secondary index. What would be the point of creating this GSI without a sort key? The more I read about GSI, AWS seems to stress the flexibility GSIs have regarding specifying your own partition key and sort key. I'm not quite sure the use of adding a GSI without specifying a sort key.
GSI's with only a Partition Key allow you to query the DynamoDB table using the attribute you chose to be the Partition Key.
For example, if you have a table that has three attributes:
userId
username
updatedAt
If your table's Primary Key consists of let's say userId as the Partition Key and updatedAt as the Sort Key (which will allow you to query the table for the list of users sorted by updatedAt date), then you can add a GSI with only the username as the Partition Key to query that same table for a specific username.
GSI gives you the ability to use an Index key - giving you the ability to access a key very fast O(n) on your table.
Typically, you would want to have a sort key as you get indexing on base table itself. However, when creating the table, the sort key was not created, then you cannot create it retrospectively. In that case, you can create a GSI (of course, you would create GSI's normally as well for indexing other attributes). Also, if your GSI hash key is different from main table hash key, then also sort key won't work and you need a GSI.
GSI's are stored in another table (managed by DDB itself and not shown to user). The table contains all projected attributes when creating the GSI. Whenever, a record gets updated in main table, then the GSI table is also updated with the same (though there is a small lag as its not a transaction leading to eventual consistency). So, if you query the GSI immediately after updating a record, it can happen sometimes that you get old/stale data.