Is there any kind of life retention period concept in DynamoDB.
I mean is there any way such that data inside a table will be deleted after some time like we can set some retention period in S3.
Thanks,
DynamoDB has introduced a Time to Live (TTL) feature. You can create a numeric field and set the value to "time in seconds" (since epoch) when you want the record to be deleted.
DynamoDB will automatically delete the record at the specified time. Of course, TTL has to be configured on a per table basis.
You can fnd more details at: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-how-to.html
No, there is no "retention" setting available in DynamoDB.
You could run a daily/monthly query that uses a date field to filter results, and use that output to determine which Items to delete. This would need to be implemented in your own programming code.
Some users choose to use separate tables to provide ageing. For example, create a separate table for each month. Then, delete old tables once they pass a certain age. However, your software would have to know how to handle multiple tables of data.
Examples:
Reference to monthly rotation of tables
Understand Access Patterns for Time Series Data
Related
In an application that collects data from IoT devices from multiple customer in a single AWS Timestream table, what if a customer leaves and requests all of its data to be deleted, according to GDPR's right to erasure?
Timestream doesn't support deleting records, only updating them.
Possible solutions we came up so far:
update all records of the customer with zero values
use table per customer and then delete the whole table (problematic with large number of customers)
don't actually delete the data, but make it not relatable to the customer anymore (probably not GDPR compliant anyway)
Are there any other options here? Or is there no solution and we have to dump Timestream altogether and use a different product that supports deletes?
One other alternative is to use data retention to delete the data in Timestream after a period of time you define.
GDPR mentions that you need to delete the data after 1 month (it seems you can have 2 more month but then you need to inform the user that data deletion would take longer).
Keeping data in Timestream can become expensive. I usually store the data in Timestream (short retention) and S3. I use Timestream for fast analytics or near real time dashboarding. I use S3 for adhoc query, ML or historical dashboarding.
what I have seen so far is that the aws glue crawler creates the table based on the latest changes in the s3 files.
let's say crawler creates a table and then I upload a CSV with updated values in one column. the crawler is run again and it updates the table's column with the updated values. I want to be able to show a comparison of the old and new data in quick sight eventually, is this scenario possible?
for example,
right now my csv file is set up as details of one aws service, like RDS is the csv file name and the columns are account id, account name, what region is it in, etc etc
there was one column of percentage with a value 50%, it gets updated with 70%. would I be able to somehow get the old value as well to show in quicksight, to say like previously it was 50% and now its 70%
Maybe this scenerio is not even valid? because I want to be able to show like what account has what cost in xyz month and show how the cost is different in other months. If I make separate tables on each update of csv then there would be 1000+ tables at one point.
If I have understood your question correctly, you are aiming to track data over time. Above you suggest creating a table for each time series, why not instead maintain a record in a table for each time series, you can then create various Analysis over the data, comparing specific months or tracking month-by-month values.
This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?
Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.
There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.
In my use case, I need to periodically update a Dynamo table (like once per day). And considering lots of entries need to be inserted, deleted or modified, I plan to drop the old table and create a new one in this case.
How could I make the table queryable while I recreate it? Which API shall I use? It's fine that the old table is the target table. So that customer won't experience any outage.
Is it possible I have something like version number of the table so that I could perform rollback quickly?
I would suggest table name with a common suffix (some people use date, others use a version number).
Store the usable DynamoDB table name in a configuration store (if you are not already using one, you could use Secrets Manager, SSM Parameter Store, another DynamoDB table, a Redis cluster or a third party solution such as Consul).
Automate the creation and insertion of data into a new DynamoDB table. Then update the config store with the name of the newly created DynamoDB table. Allow enough time to switchover, then remove the previous DynamoDB table.
You could do the final part by using Step Functions to automate the workflow with a Wait of a few hours to ensure that nothing is happening, in fact you could even add a Lambda function that would validate whether any traffic is hitting the old DynamoDB.
I have started exploring DynamoDB and S3 recently.
I went through the Developer Guide, I was curious that Is there any way to delete data from DynamoDB Table or S3 bucket based on some condition ?
The conditions here may be time/space.
Like I want to delete data from DynamoDB/S3 after some time(say 200 days) or if it exceeds some limit like 2 GB.
For S3, it is possible to delete after a given time using lifecyles. For DynamoDB, there is no such a rule.
For both, there is no direct way to clean based on the space taken.
In practice, it is possible to do it "quite easily" using AWS Lambda and the S3 and DynamoDB events, but it requires to implement it: Each time a new entry is added (or each n times) notified by S3-/DynDB- events, compute the size and clean what should be cleaned.
What you are looking for is DynamoDB TimeToLive feature. It is still in its Beta phase.
Your DDB table must have a TTL attribute storing the time stamp (in seconds).
"TTL compares the current time in epoch time format to the time stored
in the designated TTL attribute of an item. If the epoch time value
stored in the attribute is less than the current time, the item is
marked as expired and subsequently deleted."
More details can be found in DDB docs