Finding expired data in aws dynamoDB - amazon-web-services

I have a requirement where I need to store some data in dynamo-db with a status and a timestamp. Eg. <START, 20180203073000>
Now, above status flips to STOP when I receive a message in SQS. But, to make my system error-proof, I need some mechanism through which I can identify whether a data having START status present in dynamo-db is older than 1 day then set it's status to STOP. So that, it may not wait indefinitly for the message to arrive from SQS.
Is there an aws feature which I can use to achieve this, without polling for data at regular interval ?

Not sure if this will fit your needs, but here is one possibility:
Enable TTL on your DynamoDB table. This will work if your timestamp
data attribute is a Number data type containing time in epoch
format. Once the timestamp expires, the corresponding item is
deleted from the table in the background.
Enable Streams on your DynamoDB table. Items that are deleted by TTL
will be sent to the stream.
Create Trigger that connects DynamoDB stream to Lambda function. In your case the
trigger will receive your entire deleted item.
Modify your record (set 'START' to 'STOP'), remove your timestamp attribute (items with no TTL attribute are not deleted) and re-insert into the table.
This way you will avoid the table scans searching for expired items, but on other side there might be cost associated with lambda execution.

You can try creating a GSI using the status as primary key and timestamp as sort key. When querying for expired items, use a condition expression like status = "START" and timestamp < 1-day-ago.
Be careful though, because this basically creates 2 hot partitions (START and STOP), so make sure the projection expression only has the data you need and no more.
If you have a field that's set on the status = START state but doesn't exist otherwise, you'd be able to take advantage of a sparse index (basically, DynamoDB won't index any items in a GSI if the GSI keys don't exist on the item, so you don't need to filter them on query)

Related

How can I keep only the last n items within DynamoDB?

I have a streaming app that is putting data actively into DynamoDB.
I want to store the last 100 added items and delete the older ones; it seems that the TTL feature will not work in this case.
Any suggestions?
There is no feature within Amazon DynamoDB that enforces only keeping the last n items.
Limit 100 items as the maximum within your application by perhaps storing and keeping a running counter.
I'd do this via a lambda function with a trigger on the DynamoDB in question.
The lambda would then delete the older entries each time a change is made to the table. You'd need some sort of highwater mark for the table items and some way to keep track of it. I'd have this in a secondary DynamoDB table. Each new item put to the DynamoDB item table would get that HWM add it as a field to the item and update it. Basically implementing an autoincrement field, as they don't exist in DynamoDB. Then the lambda function could delete any items with an autoincrement id that is HWM - 100 or less.
There may be better ways but this would achieve the goal.

Trigger Lambda function based on attribute value in DynamoDB

I have a DynamoDB table whose items have these attributes: id, user, status. Status can take values A or B.
Is it possible to trigger a lambda based on only the value of attribute 'status' ?
Example, trigger the lambda when a new item is added to DDB with status == A or when the status of an existing item is updated to A.
(I am looking into DynamoDB streams for achieving this, but I have not come across an example where anyone is using it for this use case.)
Is it possible to monitor a DDB based on value of a certain attribute ?
Example, when status == B, I don't want to trigger lambda, but only emit a metrics for that row. Basically, I want to have a metrics to see how many items in the table have status == B at a given point.
If not from DynamoDB , are the above two possible for any other storage type ?
Yes, as your initial research has uncovered, this is something you'll want to use DynamoDB Streams for.
You can trigger a lambda function based on an item being written, updated, or removed from Dynamo DB, and you can configure your stream subscription to filter on only attributes and values you care about.
DynamoDB recently introduced the ability to filter stream events before invoking your function, you can read more about how that works and how to configure it here
For more information about DynamoDB Stream use cases, this post may be helpful.

DynamoDb persist items for a specific time

I have this flow in which we have to persist in DynamoDb some items for a specific time. After the items has expired, we have to call some other services, to notify them that data got expired.
I was thinking about two solutions:
1) Move expiry check to Java logic:
Retrieve DynamoDb data in batches, verify the expiry items in Java, and after that delete the data in batches, and notify other services.
There are some limitations:
BatchGetItem let you retrieve max 100 items.
BatchWriteItem let you delete max 25 items
2) Move expiry check to the db logic:
Query the DynamoDb, in order to check which items has expired(and delete them), and return the id's to the client, in order for us to notify other services.
Again, there are some limitations:
The result set from a Query is limited to 1 MB per call.
For both solutions, there will be a job, that will be run periodically, or we're going to use some aws lambda that will be triggered periodically and will call an endpoint from our app that is going to delete the item from db and notify other services.
My question is if DynamoDb is proper for my case, or should I use some relational db that doesn't have these kind of limitations like Mysql? What do you think ? Thanks!
Have you considered using the DynamoDB TTL feature? This allows you to create a time-based column in your table that DynamoDB will use to automatically delete the items based on the time value.
This requires no implementation on your part and no polling, querying, or batching limitations. You will need to populate a TTL column but you may already have that information present if you are rolling your own expiration logic.
If other services need to be notified when a TTL event occurs, you can create a Lambda that processes a DynamoDB stream and take action when a TTL delete event occurs.

DynamoDB Triggers (Streams + Lambda) : details on TRIM_HORIZON?

I want to process recent updates on a DynamoDB table and save them in another one. Let's say I get updates from an IoT device irregularly put in Table1, and I need to use the N last updates to compute an update in Table2 for the same device in sync with the original updates (kind of a sliding window).
DynamoDB Triggers (Streams + Lambda) seem quite appropriate for my needs, but I did not find a clear definition of TRIM_HORIZON. In some docs I understand that it is the oldest data in Table1 (can get huge), but in other docs it would seems that it is 24h. Or maybe the oldest in the stream, which is 24h?
So anyone knows the truth about TRIM_HORIZON? Would it even be possible to configure it?
The alternative I see is not to use TRIM_HORIZON, but rather tu use LATEST and perform a query on Table1. But it sort of defeats the purpose of streams.
Here are the relevant aspects for you, from DynamoDB's documentation (1 and 2):
All data in DynamoDB Streams is subject to a 24 hour lifetime. You can
retrieve and analyze the last 24 hours of activity for any given table
TRIM_HORIZON - Start reading at the last (untrimmed) stream record,
which is the oldest record in the shard. In DynamoDB Streams, there is
a 24 hour limit on data retention. Stream records whose age exceeds
this limit are subject to removal (trimming) from the stream.
So, if you have a Lambda that is continuously processing stream updates, I'd suggest going with LATEST.
Also, since you "need to use the N last updates to compute an update in Table2", you will have to query Table1 for every update, so that you can 'merge' the current update with the previous ones for that device. I don't think you can't get around that using TRIM_HORIZON too.

Automatically Deleting data from DynamoDB Table and S3

I have started exploring DynamoDB and S3 recently.
I went through the Developer Guide, I was curious that Is there any way to delete data from DynamoDB Table or S3 bucket based on some condition ?
The conditions here may be time/space.
Like I want to delete data from DynamoDB/S3 after some time(say 200 days) or if it exceeds some limit like 2 GB.
For S3, it is possible to delete after a given time using lifecyles. For DynamoDB, there is no such a rule.
For both, there is no direct way to clean based on the space taken.
In practice, it is possible to do it "quite easily" using AWS Lambda and the S3 and DynamoDB events, but it requires to implement it: Each time a new entry is added (or each n times) notified by S3-/DynDB- events, compute the size and clean what should be cleaned.
What you are looking for is DynamoDB TimeToLive feature. It is still in its Beta phase.
Your DDB table must have a TTL attribute storing the time stamp (in seconds).
"TTL compares the current time in epoch time format to the time stored
in the designated TTL attribute of an item. If the epoch time value
stored in the attribute is less than the current time, the item is
marked as expired and subsequently deleted."
More details can be found in DDB docs