I have recently introduced TTL feature in AWS DynamoDB table, Now newly added record will expire after certain time and using DynamoDB Stream Lambda will be invoked and will push "removed" data to another archive DynamoDB table.
Now, as the old data do not have ttl attribute , How can I add this to older items (except some), I know we can use scan , but the bottle neck here is performance of it I have close to 400K items and no index, I do not want to scan whole table.
Is there any efficient way to achieve this ?
Any help or suggestion would be appreciated , Thanks.
Related
I have a streaming app that is putting data actively into DynamoDB.
I want to store the last 100 added items and delete the older ones; it seems that the TTL feature will not work in this case.
Any suggestions?
There is no feature within Amazon DynamoDB that enforces only keeping the last n items.
Limit 100 items as the maximum within your application by perhaps storing and keeping a running counter.
I'd do this via a lambda function with a trigger on the DynamoDB in question.
The lambda would then delete the older entries each time a change is made to the table. You'd need some sort of highwater mark for the table items and some way to keep track of it. I'd have this in a secondary DynamoDB table. Each new item put to the DynamoDB item table would get that HWM add it as a field to the item and update it. Basically implementing an autoincrement field, as they don't exist in DynamoDB. Then the lambda function could delete any items with an autoincrement id that is HWM - 100 or less.
There may be better ways but this would achieve the goal.
In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?
I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.
I am using dynamo db as back end database in my project, I am storing items in the table with each of size 80 Kb or more(contains nested JSON), and my partition key is a unique valued column(unique for each item). Now i want to perform pagination on this table i.e., my UI will provide(start-Integer, limit-Integer and type-2 string constants) and my API should retrieve the items from dynamo db based on the provided input query parameters from UI. I am using SCAN method from boto3 (python SDK) but this scan is reading all the items from my table prior to considering my filters and causing provision throughput error, but I cannot afford to either increase my table's throughput or opt table auto-scaling. Is there any way how my problem can be solved? Please give your suggestions
Do you have a limit set on your scan call? If not, DynamoDB will return you 1MB of data by default. You could try using limit and some kind of sleep or delay in your code, so that you process your table at a slower rate and stay within your provisioned read capacity. If you do this, you'll have to use the LastEvaluatedKey returned to you to page through your table.
Keep in mind that just to read a single one of your 80kb items, you'll be using 10 read capacity units, and perhaps more if your items are larger.
My goal is to take daily snapshots of an RDS table and put it in a DynamoDB table. The table should only contain data from a single day.
For this have a Data Pipeline set up to query a RDS table and publish the results into S3 in CSV format.
Then a HiveActivity imports this CSV into a DynamoDB table by creating external tables for the file and an existing DynamoDB table.
This works great, but older entries from the previous day still exist in the DynamoDB table. I want to do this within Data Pipeline if at all possible. I need to:
1) Find a way to clear the DynamoDB table, or at least drop/recreate it, or
2) Include an extra column of the snapshot date and find a way to clear out all older entries.
Any ideas on how I can do this?
You can use DynamoDb Time to Live(TTL) which allows you to set an expiration time after which items are auto deleted from the DynamoDb table. TTL is very useful for cases where data loses it's relevance after a specific time period and in your case it can be start time of next day.
Is there a way to check when a dynamoDB item was last updated without adding a separate attribute for this? (For example, are there any built-in attributes or metadata that could provide this information?)
No. This is not a built-in feature of the DynamoDB API.
You have to implement yourself by adding a column to each item for each UpdatedTime with the current time.
For example, are there any built-in attributes or metadata that could
provide this information? No
There are multiple approaches to implement this using DynamoDB.
Use either sort key, GSI or LSI with time stamp attribute, to query last updated item.
When adding an item to the table, keep track of last updated time at your Backend.
Using DynamoDB streams, create a Lambda function which executives, when an item is added to track last updated time.
Note: If you are going with last two approaches, you can still use a seperate DynamoDB table to store Metadata such as last updated attribute.
I don't think there is an out of the box solution for that but you can use DynamoDB streams with basic Lambda function to keep track of which items are updated, then you can store this information somewhere else like S3(through Kinesis Firehose) or you can update the same table.
It may be possible when using Global Tables, with the automatically created aws:rep:updatetime attribute.
See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_HowItWorks.html
It's not clear if this functionality remains with the latest version though - I'll update this answer if I find out concretely.