I am new to nosql / DynamoDB.
I have a list of ~10 000 container-items records, which is updated every 6 hours:
[
{ containerId: '1a3z5', items: ['B2a3, Z324, D339, M413'] },
{ containerId: '42as1', items: ['YY23, K132'] },
...
]
(primary key = containerId)
Is it viable to just delete the table, and recreate with new values?
Or should I loop through every item of the new list, and conditionally update/write/delete the current DynamoDB records (using batchwrite)?
For this scenario batch update is better approach. You have 2 cases:
If you need to update only certain records than batch update is more efficient. You can scan the whole table and iterate thought the records and only update certain records.
If you need to update all the records every 6 hours batch update will be more efficient, because if you drop the table and recreate table, that also means you have to recreate indexes and this is not a very fast process. And after you recreate table you still have to do the inserts and in the meantime you have to keep all the records in another database or in-memory.
One scenario where deleting the whole table is a good approach if you need to delete all the data from the table with thousands or more records, than its much faster to recreate table, than delete all the records though API.
And one more suggestion have you considered alternatives, because your problem does not look like a good use-case for DynamoDB. For example MongoDB and Cassandra support update by query out of the box.
If the update touches some but not all existing items and if partial update of 'items' is possible then you have no choice but to do a per record operation. And this would be true even with a more capable database.
You can perhaps speed it up by retrieving only the existing containerIds first so based on that set you know which to do update versus insert on. Alternately you can do a batch retrieve by ids using the ids from the set of updates and which every ones do not return a result are the ones you have to insert and the ones where you do are the ones to update.
Related
I want to truncate dynamodb table which can have up to 3 millions to 4 millions of records. what is the best way?
Right now I am using scan which does not give good performance(I have tried to delete only for few records: 3):
DynamoDB dynamoDB = new DynamoDB(amazonDynamoDBClient);
Table table = dynamoDB.getTable("table-test");
ItemCollection<ScanOutcome> resultItems = table.scan();
Iterator<Item> itemsItr = resultItems.iterator();
while(itemsItr.hasNext()){
Item item = itemsItr.next();
String itemPk = (String) item.get("PK");
String itemSk = (String) item.get("SK");
DeleteItemSpec deleteItemSpec = new DeleteItemSpec().withPrimaryKey("PK", itemPk, "SK", itemSk);
table.deleteItem(deleteItemSpec);
}
The best way is to delete your table, and create new one of the same name. This is how clearing all data from DynamoDB is usually performed.
As Marcin already answered, the best way is to delete your table and create a new one. It is certainly the cheapest way - because any other way would require scanning the entire table and paying for the read capacity units required to do it.
In some cases, however, you might want to delete old items while the table is still actively used. In that case you can use a Scan like you wanted, but can do it much more efficiently than you did: First, don't run individual DeleteItem requests sequentially, waiting for one delete to complete before asking for the next one... You can send batches of 25 deletes in one BatchWriteItem request. You can also send multiple BatchWriteItem requests in parallel. Finally, for even faster deletion, you can parallelize your Scan to multiple threads or even machines - see the parallel scan section of the DynamoDB documentation. Just don't forget that if you delete items while the table is still actively written to, you need a way to tell old items which you want to delete, from new items that you don't want to delete - as the scan may start producing these new items as well.
Finally, if you find yourself often clearing old data from a table - you should consider whether you can use DynamoDB's TTL feature, where DynamoDB automatically looks for expired items (based on an expiration-time attribute on each item) and deletes them - at no cost to you.
I have a dynamo db table with following structure
partitionKey - userId+keyName
sortKey - keyName+itemId
itemData - any object
createdAt - long value
updatedAt - long value
In this table I want to save list of items lets say all unique eatable items found in a shop. As per the requirement I need to find out the count of items in a particular shop. As per my findings I came across three ways to do this
Use Query to fetch count as per this link without explicitly saving count value.
Use transactions while saving items and store/update count explicitly. [We want to add/remove multiple items in a single request]. And later get count using GetItem api.
Use dynamo db streams to trigger SNS and eventually store explicit count in the same table/different table. And later get count using GetItem api.
Note
Latency is important here along with the cost.
You can assume this dynamo db table can have millions of items.
Eventual consistency is fine.
In my view 3rd option looks more efficient in terms of cost, latency. But want to know if my thoughts are correct
Using Dynamo streams to write aggregate data back to Dynamo is definitely the way to go!
This will of course be eventually consistent by its nature, as updating your item and waiting for the stream to update the aggregate are two different non-atomic operations.
A fourth option would be to have something like an ElasticSearch index updated (also by using streams), which allows you to do arbitrary ad-hoc queries.
If you need consistency for your aggregates, you have to use transactions for this, with all the limitations imposed.
I feel like I'm thinking my self in circles here. Maybe you all can help :)
Say I have this simple table design in DynamoDB:
Id | Employee | Created | SomeExtraMetadataColumns... | LastUpdated
Say my only use case is to find all the rows in this table where LastUpdated < (now - 2 hours).
Assume that 99% of the data in the table will not meet this criteria. Assume there is a some job running every 15 mins that is updating the LastUpdated column.
Assume there are say 100,000 rows and grows maybe 1000 rows a day. (no need to large write capacity).
Assume a single entity will be performing this 'read' use case (no need for large read capacity).
Options I can think of:
Do a scan.
Pro: can leverage parallels scans to scale in the future.
Con: wastes a lot of money reading rows that do not match the filter criteria.
Add a new column called 'Constant' that would always have the value of 'Foo' and make a GSI with the Partition Key of 'Constant' and a Sort Key of LastUpdated. Then execute a query on this index for Constant = 'Foo' and LastUpdated < (now - 2hours).
Pro: Only queries the rows matching the filter. No wasted money.
Con: In theory this would be plagued by the 'hot partition' problem if writes scale up. But I am unsure how much of a problem it will be as aws outlined this problem to be a thing of the past.
Honestly, I leaning toward the latter option. But I'm curious what the communities thoughts are on this. Perhaps I am missing something.
Based on the assumption that the last_updated field is the only field you need to query against, I would do something like this:
PK: EMPLOYEE::{emp_id}
SK: LastUpdated
Attributes: Employee, ..., Created
PK: EMPLOYEE::UPDATE
SK: LastUpdated::{emp_id}
Attributes: Employee, ..., Created
By denormalising your data here you have the ability to create an update record with an update row which can be queried with PK = EMPLOYEE::UPDATE and SK between 'datetime' and 'datetime'. This is assuming you store the datetime as something like 2020-10-01T00:00:00Z.
You can either insert this additional row here or you could consider utilising DynamoDB streams to stream update events to Lambda and then add the row from there. You can set a TTL on the 'update' row which will expire somewhere between 0 and 48 hours from the TTL you set keeping the table clean. It doesn't need to be instantly removed because you're querying based on the PK and SK anyway.
A scan is an absolute no-no on a table that size so I would definitely recommend against that. If it increases by 1,000 per day like you say then before long your scan would be unmanageable and would not scale. Even at 100,000 rows a scan is very bad.
You could also utilise DynamoDB Streams to stream your data out to data stores which are suitable for analytics which is what I assume you're trying to achieve here. For example you could stream the data to redshift, RDS etc etc. Those require a few extra steps and could benefit from kinesis depending on the scale of updates but it's something else to consider.
Ultimately there are quite a lot of options here. I'd start by investigating the denormalisation and then investigate other options. If you're trying to do analytics in DynamoDB I would advise against it.
PS: I nearly always call my PK and SK attributes PK and SK and have them as strings so I can easily add different types of data or denormalisations to a table easily.
Definitely stay away from scan...
I'd look at a GSI with
PK: YYYY-MM-DD-HH
SK: MM-SS.mmmmmm
Now to get the records updated in the last two hours, you need only make three queries.
We have a huge DynamoDB table (~ 4 billion items) and one of the columns is some kind of category (string) and we would like to map this column to either new one category_id (integer) or update existing one from string to int. Is there a way to do this efficiently without creating new table and populating it from beginning. In other words to update existing table?
Is there a way to do this efficiently
Not in DynamoDB, that use case is not what it's designed for...
Also note, unless you're talking about the hash or sort key (of the table or of an existing index), DDB doesn't have columns.
You'd run Scan() (in a loop since it only returns 1MB of data)...
Then Update each item 1 at a time. (note could BatchUpdate of 10 items at a time, but that save just network overhead..still does 10 individual updates)
If the attribute in question is used as a key in the table or an existing index...then a new table is your only option. Here's a good article with a strategy for migrating a production table.
Create a new table (let us call this NewTable), with the desired key structure, LSIs, GSIs.
Enable DynamoDB Streams on the original table
Associate a Lambda to the Stream, which pushes the record into NewTable. (This Lambda should trim off the migration flag in Step 5)
[Optional] Create a GSI on the original table to speed up scanning items. Ensure this GSI only has attributes: Primary Key, and Migrated (See Step 5).
Scan the GSI created in the previous step (or entire table) and use the following Filter:
FilterExpression = "attribute_not_exists(Migrated)"
Update each item in the table with a migrate flag (ie: “Migrated”: { “S”: “0” }, which sends it to the DynamoDB Streams (using UpdateItem API, to ensure no data loss occurs).
NOTE You may want to increase write capacity units on the table during the updates.
The Lambda will pick up all items, trim off the Migrated flag and push it into NewTable.
Once all items have been migrated, repoint the code to the new table
Remove original table, and Lambda function once happy all is good.
I have next table structure:
ID string `dynamodbav:"id,omitempty"`
Type string `dynamodbav:"type,omitempty"`
Value string `dynamodbav:"value,omitempty"`
Token string `dynamodbav:"token,omitempty"`
Status int `dynamodbav:"status,omitempty"`
ActionID string `dynamodbav:"action_id,omitempty"`
CreatedAt time.Time `dynamodbav:"created_at,omitempty"`
UpdatedAt time.Time `dynamodbav:"updated_at,omitempty"`
ValidationToken string `dynamodbav:"validation_token,omitempty"`
and I have 2 Global Secondary Indexes for Value(ValueIndex) filed and Token(TokenIndex) field. Later somewhere in the internal logic I perform the Update of this entity and immediate read of this entity by one of this indexes(ValueIndex or TokenIndex) and I see the expected problem that data is not ready(I mean not yet updated). I can't use ConsistentRead for this cases, because this is Global Secondary Index and it doesn't support this options. As a result I can't run my load tests over this logic, because data is not ready when tests go in 10-20-30 threads. So my question - is it possible to solve this problem somewhere? or should I reorganize my table and split it to 2-3 different tables and move filed like Value, Token to HASH key or SORT key?
GSIs are updated asynchronously from the table they are indexing. The updates to a GSI typically occur in well under a second. So, if you're after immediate read of a GSI after insert / update / delete, then there is the potential to get stale data. This is how GSIs work - nothing you can do about that. However, you need to be really mindful of three things:
Make sure you keep your GSI lean - that is, only project the absolute minimum attributes that you need. Less data to write will make it quicker.
Ensure that your GSIs have the correct provisioned throughput. If it doesn't, it may not be able to keep up with activity in the table and therefore you'll get long delays in the GSI being kept in sync.
If an update causes the keys in the GSI to be updated, you'll need 2 units of throughput provisioned per update. In essence, DynamoDB will delete the item then insert a new item with the keys updated. So, even though your table has 100 provisioned writes, if every single write causes an update to your GSI key, you'll need to provision 200 write units.
Once you've tuned your DynamoDB setup and you still absolutely cannot handle the brief delay in GSIs, you'll probably need to use different technology. For example, even if you decided to split your table into multiple tables, it'll have the same (if not worse) impact. You'll update one table, then try to read the data from another table and you haven't yet inserted the values into a different table.
I suspect that once you tune DynamoDB for your situation, you'll get pretty damn close you what you want.