Automatically Deleting data from DynamoDB Table and S3 - amazon-web-services

I have started exploring DynamoDB and S3 recently.
I went through the Developer Guide, I was curious that Is there any way to delete data from DynamoDB Table or S3 bucket based on some condition ?
The conditions here may be time/space.
Like I want to delete data from DynamoDB/S3 after some time(say 200 days) or if it exceeds some limit like 2 GB.

For S3, it is possible to delete after a given time using lifecyles. For DynamoDB, there is no such a rule.
For both, there is no direct way to clean based on the space taken.
In practice, it is possible to do it "quite easily" using AWS Lambda and the S3 and DynamoDB events, but it requires to implement it: Each time a new entry is added (or each n times) notified by S3-/DynDB- events, compute the size and clean what should be cleaned.

What you are looking for is DynamoDB TimeToLive feature. It is still in its Beta phase.
Your DDB table must have a TTL attribute storing the time stamp (in seconds).
"TTL compares the current time in epoch time format to the time stored
in the designated TTL attribute of an item. If the epoch time value
stored in the attribute is less than the current time, the item is
marked as expired and subsequently deleted."
More details can be found in DDB docs

Related

How to implement GDPR Right to erasure with AWS Timestream

In an application that collects data from IoT devices from multiple customer in a single AWS Timestream table, what if a customer leaves and requests all of its data to be deleted, according to GDPR's right to erasure?
Timestream doesn't support deleting records, only updating them.
Possible solutions we came up so far:
update all records of the customer with zero values
use table per customer and then delete the whole table (problematic with large number of customers)
don't actually delete the data, but make it not relatable to the customer anymore (probably not GDPR compliant anyway)
Are there any other options here? Or is there no solution and we have to dump Timestream altogether and use a different product that supports deletes?
One other alternative is to use data retention to delete the data in Timestream after a period of time you define.
GDPR mentions that you need to delete the data after 1 month (it seems you can have 2 more month but then you need to inform the user that data deletion would take longer).
Keeping data in Timestream can become expensive. I usually store the data in Timestream (short retention) and S3. I use Timestream for fast analytics or near real time dashboarding. I use S3 for adhoc query, ML or historical dashboarding.

Copying DynamoDB table to S3 every time table is updated

I have a DynamoDB table that's fairly large and is updated infrequently and sporadically (maybe once a month). When it is updated it happens in a large batch, i.e. a large number of items are inserted with batch update.
Every time the DynamoDB table changes, I want to convert the contents to a custom JSON format and store the result in S3 to be more easily accessed by clients (since the most common operation is downloading the entire table, and DynamoDB scans are not as efficient).
Is there a way to convert the table to JSON and copy to S3 after every batch update?
I saw Data Pipelines (https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-exportddbtos3.html) but it seems like this is triggered manually or on a schedule. I also looked at DynamoDB Streams, but it seems like an event is inserted for every record changed. I want to do the conversion and copying to S3 only once per batch update.

DynamoDb persist items for a specific time

I have this flow in which we have to persist in DynamoDb some items for a specific time. After the items has expired, we have to call some other services, to notify them that data got expired.
I was thinking about two solutions:
1) Move expiry check to Java logic:
Retrieve DynamoDb data in batches, verify the expiry items in Java, and after that delete the data in batches, and notify other services.
There are some limitations:
BatchGetItem let you retrieve max 100 items.
BatchWriteItem let you delete max 25 items
2) Move expiry check to the db logic:
Query the DynamoDb, in order to check which items has expired(and delete them), and return the id's to the client, in order for us to notify other services.
Again, there are some limitations:
The result set from a Query is limited to 1 MB per call.
For both solutions, there will be a job, that will be run periodically, or we're going to use some aws lambda that will be triggered periodically and will call an endpoint from our app that is going to delete the item from db and notify other services.
My question is if DynamoDb is proper for my case, or should I use some relational db that doesn't have these kind of limitations like Mysql? What do you think ? Thanks!
Have you considered using the DynamoDB TTL feature? This allows you to create a time-based column in your table that DynamoDB will use to automatically delete the items based on the time value.
This requires no implementation on your part and no polling, querying, or batching limitations. You will need to populate a TTL column but you may already have that information present if you are rolling your own expiration logic.
If other services need to be notified when a TTL event occurs, you can create a Lambda that processes a DynamoDB stream and take action when a TTL delete event occurs.

DynamoDB Triggers (Streams + Lambda) : details on TRIM_HORIZON?

I want to process recent updates on a DynamoDB table and save them in another one. Let's say I get updates from an IoT device irregularly put in Table1, and I need to use the N last updates to compute an update in Table2 for the same device in sync with the original updates (kind of a sliding window).
DynamoDB Triggers (Streams + Lambda) seem quite appropriate for my needs, but I did not find a clear definition of TRIM_HORIZON. In some docs I understand that it is the oldest data in Table1 (can get huge), but in other docs it would seems that it is 24h. Or maybe the oldest in the stream, which is 24h?
So anyone knows the truth about TRIM_HORIZON? Would it even be possible to configure it?
The alternative I see is not to use TRIM_HORIZON, but rather tu use LATEST and perform a query on Table1. But it sort of defeats the purpose of streams.
Here are the relevant aspects for you, from DynamoDB's documentation (1 and 2):
All data in DynamoDB Streams is subject to a 24 hour lifetime. You can
retrieve and analyze the last 24 hours of activity for any given table
TRIM_HORIZON - Start reading at the last (untrimmed) stream record,
which is the oldest record in the shard. In DynamoDB Streams, there is
a 24 hour limit on data retention. Stream records whose age exceeds
this limit are subject to removal (trimming) from the stream.
So, if you have a Lambda that is continuously processing stream updates, I'd suggest going with LATEST.
Also, since you "need to use the N last updates to compute an update in Table2", you will have to query Table1 for every update, so that you can 'merge' the current update with the previous ones for that device. I don't think you can't get around that using TRIM_HORIZON too.

Deleting Data from DynamoDb Table automatically

Is there any kind of life retention period concept in DynamoDB.
I mean is there any way such that data inside a table will be deleted after some time like we can set some retention period in S3.
Thanks,
DynamoDB has introduced a Time to Live (TTL) feature. You can create a numeric field and set the value to "time in seconds" (since epoch) when you want the record to be deleted.
DynamoDB will automatically delete the record at the specified time. Of course, TTL has to be configured on a per table basis.
You can fnd more details at: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-how-to.html
No, there is no "retention" setting available in DynamoDB.
You could run a daily/monthly query that uses a date field to filter results, and use that output to determine which Items to delete. This would need to be implemented in your own programming code.
Some users choose to use separate tables to provide ageing. For example, create a separate table for each month. Then, delete old tables once they pass a certain age. However, your software would have to know how to handle multiple tables of data.
Examples:
Reference to monthly rotation of tables
Understand Access Patterns for Time Series Data