I am trying to aggregate new records from Dynamo, my lambda was working too quickly and firing records back into Dynamo too fast, my application is only getting 5 records per second.
I am trying to build a reader for the stream to be called every minute to roll up the stats.
I went through the process of
ListStreams for table name
DescribeStream
GetShardIterator for each shard using TRIM_HORIZON
GetRecords
Then recursively process the NextShardIterator until it returns nil, I have limited now to 5 recursions as it does not seem to end
Every time I run this I now get 16 records, which is not really what I want, I only want the records I have not processed.
Do I need to use some form of persistence to store the maximum SequenceNumber that I have processed?
Related
I store sensor data (power and voltage measurements) coming from different devices in DynamoDB (partition key: deviceid and sort key: timestamp).
Every hour, I would like to access the data that has been stored in that time frame, do some calculations with it and store the results elsewhere.
My initial idea was to run a Lambda that would be triggered by a CloudWatch Rule, do the calculations and store the results in another DynamoDB table.
I saw a similar question and the answer suggested DynamoDB Streams instead. But aren´t Streams supposed to be triggered every time an item is updated/deleted/inserted?. I understand there are conditions to invoke the Lambda with the Streams service that could allow me to do so every hour, but I don´t think this is the most efficient way to get around the problem.
So my question is if there is an efficient way/service to accomplish this?
DynamoDB Streams can efficiently process this data using batch windows and tumbling windows. These are built into DynamoDB stream. Batch windows can ensure the lambda function is only invoked once every 5 minutes or after 6MB of data, and tumbling windows allow you to perform running calculations for up to 15 minutes.
I have the following use-case:
Users respond to a poll with a thumbs up/down vote. The poll will last for only a few seconds. Expecting around 30K users to respond to the poll in 10-15 seconds. The aggregated count should be displayed in real-time. Speed is more important than consistency of data.
Approach#1
Store data of each customer as a separate record in a Dynamodb table. Turn on streams for this table. Lambda can aggregate and store the count in a separate record. This record will be updated for each lambda invocation.
Pros: If batch size is 1, real time can be achieved.
Cons: For 30K users, 30K lambda invocations will be required. Updating the aggregated in the same record might lead to hot partition issue in Dynamodb.
Approach#2
Stream the data using Kinesis data streams. Batch process the data (for every 1 second) using a lambda. This lambda will aggregate the records and update a count record in Dynamodb.
Pros: Fewer lambda invocations. No hot partition issue in Dynamodb.
Cons: With batch processing, absolute real time cannot be achieved. Users can respond to the poll multiple times. UserId level data is not persisted.
Need suggestions on the above approaches or any other alternative approach to achieve real time aggregation for the above use-case
I have to set up a management on an AWS process.
To keep things simple I have some clients that sends me heartbeat, let's say every 5 minutes, via SOAP requests to my SOAP server deployed on an Elastic Beanstalk NodeJS app. Every time I receive a heartbeat, I store the last time I received it on a DynamoDB table by updating a field on the table.
I now need to create a process that, if I haven't received an heartbeat in the last 30 minutes, does stuff (updates another tables, calls Lambda functions, etc). I don't know now how many clients I will have, but they will be pontentially growing with time, and connected to my server 24/7.
I was hoping on something like an event that triggers a Lambda function (or posts a message on a SNS topic) after those 30 minutes that that specific row in the table is not updated, but I don't know how to get this last part to work. This event should be checking every row in the document.
How would you do it?
Thank you!
You can use DynamoDB with TTL, DynamoDB Streams and AWS Lambda for this. No need for cron.
When you create a new row or update an existing row, set that row's TTL to 30 minutes in the future.
When that 30 minutes is reached, it will fire up a DynamoDB stream which you can use as a trigger for a Lambda function.
This Lambda function can then do the custom processing that you want to do (i.e. updates another tables, calls Lambda functions, etc).
Take note that the original DynamoDB row will be deleted when its TTL expires. If you need to keep that record, you can let the Lambda function recreate it and set a new TTL to another 30 minutes in the future.
References:
DynamoDB Streams and Time To Live
DynamoDB Streams and AWS Lambda Triggers
do you need to check in 30 minutes interval across all clients: this can be done by a cron job on a server that executes an SQL statement where Date()- Timestamp > 30 min ( just stating the logic ), in the DynamoDB format just changes the syntax.
or you need to check after 30 minutes from the last time a particular client sent you.
in this case your cron would be run every minute with same logic as above
if you need to use lambda in a schedule refer to this link Using Lambda in schedule
Hope this helps
if you need further help in the syntax i m ready to help
I have several related questions here.
DynamoDB documentation on streams says:
A shard might split in response to high levels of write activity on its parent table, so that applications can process records from multiple shards in parallel.
My understanding is that when a shard splits into two child shards, DynamoDB stops writing to parent shard and starts writing to both of the child shards in a round-robin fashion. How, in this case, I can establish the chronological order of records? Do I have to read both child shards and sort records by record sequence number in the application layer? What if second child at some point splits into two grandchild shards? Do I now have to read both child and grandchild shards before getting records in order?
The aforementioned documentation says:
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard.
If you take a look at the Low-Level DynamoDB Streams API example provided in the documentation, under // Get the shards in the stream comment you'll notice that the code simply gets all shards for a given stream and then iterates over list of shards without bothering with parent-child relationships.
Does it mean that if I want to get a list of records in a chronological order I have to read ALL records from a given stream and then sort them by the record sequence number in the application layer?
Is trying to get a chronological record order out of the DynamoDB stream a bad idea at all? Please don't ask me about a concrete problem I am trying to solve. I am theorizing here.
UPDATE:
The above questions visited me when I was thinking about processing past 24 hours of stream records. But why would one want to process past 24 hours of the stream data in the first place?
I think that streams are built for real-time table change processing in the first place. And processing stream records real-time by triggering a Lambda function makes more sense.
The only use case for going through the past 24 hours of stream records that comes to my mind is some kind of stream record processing failure recovery (for a failure that was very quickly detected).
Bonus question:
Can you think of use cases when you would want to dig through past 24 hours of a DynamoDB stream?
I want to process recent updates on a DynamoDB table and save them in another one. Let's say I get updates from an IoT device irregularly put in Table1, and I need to use the N last updates to compute an update in Table2 for the same device in sync with the original updates (kind of a sliding window).
DynamoDB Triggers (Streams + Lambda) seem quite appropriate for my needs, but I did not find a clear definition of TRIM_HORIZON. In some docs I understand that it is the oldest data in Table1 (can get huge), but in other docs it would seems that it is 24h. Or maybe the oldest in the stream, which is 24h?
So anyone knows the truth about TRIM_HORIZON? Would it even be possible to configure it?
The alternative I see is not to use TRIM_HORIZON, but rather tu use LATEST and perform a query on Table1. But it sort of defeats the purpose of streams.
Here are the relevant aspects for you, from DynamoDB's documentation (1 and 2):
All data in DynamoDB Streams is subject to a 24 hour lifetime. You can
retrieve and analyze the last 24 hours of activity for any given table
TRIM_HORIZON - Start reading at the last (untrimmed) stream record,
which is the oldest record in the shard. In DynamoDB Streams, there is
a 24 hour limit on data retention. Stream records whose age exceeds
this limit are subject to removal (trimming) from the stream.
So, if you have a Lambda that is continuously processing stream updates, I'd suggest going with LATEST.
Also, since you "need to use the N last updates to compute an update in Table2", you will have to query Table1 for every update, so that you can 'merge' the current update with the previous ones for that device. I don't think you can't get around that using TRIM_HORIZON too.