real time data aggregation - Dynamodb streams vs Kinesis data streams - amazon-web-services

I have the following use-case:
Users respond to a poll with a thumbs up/down vote. The poll will last for only a few seconds. Expecting around 30K users to respond to the poll in 10-15 seconds. The aggregated count should be displayed in real-time. Speed is more important than consistency of data.
Approach#1
Store data of each customer as a separate record in a Dynamodb table. Turn on streams for this table. Lambda can aggregate and store the count in a separate record. This record will be updated for each lambda invocation.
Pros: If batch size is 1, real time can be achieved.
Cons: For 30K users, 30K lambda invocations will be required. Updating the aggregated in the same record might lead to hot partition issue in Dynamodb.
Approach#2
Stream the data using Kinesis data streams. Batch process the data (for every 1 second) using a lambda. This lambda will aggregate the records and update a count record in Dynamodb.
Pros: Fewer lambda invocations. No hot partition issue in Dynamodb.
Cons: With batch processing, absolute real time cannot be achieved. Users can respond to the poll multiple times. UserId level data is not persisted.
Need suggestions on the above approaches or any other alternative approach to achieve real time aggregation for the above use-case

Related

Dynamodb streams: small number of items per batch

I have a very large dynamodb table, and I want to use lambda function triggered by a stream. I would like to work in big batches, of at least 1000 items. But when I connect the lambda, I see it is invoked with tiny batches of 1 or 2 items. I increased the window to 15 seconds, and it doesn't help.
I assume it's because the table has a lot of shards, and every batch gathers items from only one shard. Is this correct?
What can be done in order to increase the batch size?
I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.
DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.
Conceptually, this is how the Lambda Service polls the stream shards:
Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.
This diagram shows how the configuration options in the event source mapping influence how processing happens.
Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions).
If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).
The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.
You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.
(I also considered writing to SQS, but the max batch size you can request from there is 10.)

Is there a efficient solution to access time series data from DynamoDB every hour?

I store sensor data (power and voltage measurements) coming from different devices in DynamoDB (partition key: deviceid and sort key: timestamp).
Every hour, I would like to access the data that has been stored in that time frame, do some calculations with it and store the results elsewhere.
My initial idea was to run a Lambda that would be triggered by a CloudWatch Rule, do the calculations and store the results in another DynamoDB table.
I saw a similar question and the answer suggested DynamoDB Streams instead. But aren´t Streams supposed to be triggered every time an item is updated/deleted/inserted?. I understand there are conditions to invoke the Lambda with the Streams service that could allow me to do so every hour, but I don´t think this is the most efficient way to get around the problem.
So my question is if there is an efficient way/service to accomplish this?
DynamoDB Streams can efficiently process this data using batch windows and tumbling windows. These are built into DynamoDB stream. Batch windows can ensure the lambda function is only invoked once every 5 minutes or after 6MB of data, and tumbling windows allow you to perform running calculations for up to 15 minutes.

Can I aggregate data from a stream on AWS?

I have data coming from multiple machines, I would like to aggregate it by user. I'm thinking of producing batches of 1000 "rows", or 10 seconds of data (whichever comes first), by user.
I do have some experience with AWS kinesis and lambdas, but in my experience we don't have so much control on how the aggregation is done. All machines would send the data by kinesis, with the user id as the partition key. Then AWS will call our lambda with small batches. This has been great for some other use cases but here if I receive 100 records I don't know what to do (I would like to "wait" to receive more or wait that 10 seconds elapse since the date of the first record).
Also I'm not sure how the aggregation "by user id" would work. So far, on a lambda, I would have split the records in the batch by user id, but then if I get called with a batch of 100 records, even though there is a partition key on the user id, there is no guarantee that those 100 records would be for 1 user. Maybe I will get 100 records from 100 different users, and there is no "aggregation" help at all.
Any idea if kinesis + lambda is suited for this? I did look at the documentation of AWS but I don't see my scenario. It looks like they also have a tool "Data Streams" but it's hard for me to understand if this would work for my case.
Thanks!
Your understanding is correct. AWS Lambda + Kinesis alone will not be sufficient alone for aggregation. AWS Lambda programming model is stateless, so you can only aggregate based on the batch of records received in that particular invocatio(GetRecords API) call. Furthermore, the batch size provided in the function does not gurantee that you will get that number of records. This is merely the maximum number of records which you can get(MaxRecords) per invocation.
What you need is some kind of windowing mechanism, either row-based or time-based. Kinesis Analytics would be the easiest and fastest to get on-boarded with this. You can either use SQL or Flink with Kinesis analytics. You can even have your output to AWS Lambda for post processing.
Other ways would be use a Spark streaming job (you can use AWS EMR) and use windowing in your application.

AWS: Execute a task after 1 year has elapsed

Basically, I have a web service that receives a small json payload (an event) a few times per minute, say 60. This event must be sent to an SQS queue only after 1 year has elapsed (it's ok to have it happen a few hours sooner or later, but the day of month should be exactly the same).
This means I'll have to store more than 31 million events somewhere before the first one should be sent to the SQS queue.
I thought about using SQS message timers, but they have a limit of only 15 minutes, and as pointed out by #Charlie Fish, it's weird to have an element lurking around on a queue for such a long time.
A better possibility could be to schedule a lambda function using a Cron expression for each event (I could end up with millions or billions of scheduled lambda functions in a year, if I don't hit an AWS limit well before that).
Or I could store these events on DynamoDB or RDS.
What would be the recommended / most cost-effective way to handle this using AWS services? Scheduled lambda functions? DynamoDB? PostgreSQL on RDS? Or something entirely different?
And what if I have 31 billion events per year instead of 31 million?
I cannot afford to loose ANY of those events.
DynamoDB is a reasonable option, as is RDS - SQS for long term storage is not a good choice. However - if you want to keep your costs down, I may suggest another: accumulate the events for a single 24 hour period (or a smaller interval if that is desirable), and write that set of data out as an S3 object instead of keeping it in DynamoDB. You could employ dynamodb or rds (or just about anything else) as a place to accumulate events for the day (or hour) before it then writes out that data to S3 as a single set of data for the interval.
Each S3 object could be named appropriately, either indicating the date/time it was created, or the data/time it needs to be used, i.e. 20190317-1400 to indicate that on March 17th, 2019 at 2PM this file needs to be used.
I would imagine a lambda function, called by a cloudwatch event that is triggered every 60 minutes, scans your s3 bucket looking for files that are due to be used, and it then reads in the json data and puts them into an SQS queue for further processing and moves the processed s3 object to another 'already processed' bucket
Your storage costs would be minimal (especially if you batch them up by day or hour), S3 has 11 9's of durability, and you can archive older events off to Glacier if you want to keep them around even after the are processed.
DynamoDB is a great product, it provides redundant storage, and super high performance - but I see nothing in your requirements to that would warrant incurring that cost or requiring the performance of DynamoDB; and why keep millions of records of data in a 'always on' database when you know in advance that you don't need to use or see the records until a year from now.
I mean you could store some form of data in DynamoDB, and run some daily Lambda task to query for all the items that are greater than a year old, remove those from DynamoDB and import it into SQS.
As you mentioned SQS doesn't have this functionality built in. So you need to store the data using some other technology. DynamoDB seems like a responsible choice based on what you have mentioned above.
Of course you also have to think about if doing a cron task once per day is sufficient for your task. Do you need it to be exactly after 1 year? Is it acceptable to have it be one year and a few days? Or one year and a few weeks? What is the window that is acceptable for importing into SQS?
Finally, the other question you have to think about is if SQS is even reasonable for your application. Having a queue that has a 1 year delay seems kinda strange. I could be wrong, but you might want to consider something besides SQS because SQS is meant for much more instantaneous tasks. See the examples on this page (Decouple live user requests from intensive background work: let users upload media while resizing or encoding it, Allocate tasks to multiple worker nodes: process a high number of credit card validation requests, etc.). None of those examples are really meant for a year of wait time before executing. At the end of the day it depends on your use case, but off the top of my head I can't think of a situation that makes sense for delaying entry into an SQS queue for a year. There seem to be much better ways to handle this, but again I don't know your specific use case.
EDIT another question is if your data is consistent? Is the amount of data you need to store consistent? How about the format? What about the number of events per second? You mention that you don’t want to lose any data. For sure build in error handling and backup systems. But for DynamoDB it doesn’t scale the best if one moment you store 5 items then the next moment you want to store 5 million items. If you set your capacity to account for 5 million then it is fine. But the question is will the amount of data and frequency be consistent or not?

Amzon Web Services (AWS) - Aggregating DynamoDB Data

We have a DynamoDB Database that is storing machine sensor information in the "structure" of :
HashKey: MachineNumber (Number)
SortKey: EntryDate (String)
Columns: SensorType (String), SensorValue (Number)
The sensors generate information almost every 3 seconds and we're looking to measure a (near) real-time KPI to count how many machines in a region were down in the past hour for more than 10 minutes. A region can have close to 10000 machines so iterating through DynamoDB is taking almost 10+ minutes for a response. What is the best way to do this?
Describing the answer as discussed in comments on the question.
Performing a table scan on a very large table is expensive and should be avoided. DynamoDB Streams provides the ability to process records using your own custom code after they are inserted. This allows for aggregations or other computations to be performed asynchronously in near real time. The result can then be written or updated in a separate DynamoDB table.
You can run the code that processes the DynamoDB Stream messages on your own server (example: EC2), but it is likely easier to just utilize Lambda. Lambda lets you write Java or NodeJS code that will be run on AWS infrastructure that is fully managed so all you need to worry about is the code.