Kinesis Analytics Destination Guidance: Lambda vs Kinesis Stream to Lambda - amazon-web-services

After Kinesis Analytics does it's job, the next step is to send that information off to a destination. AWS currently offers 3 destination choices:
Kinesis stream
Kinesis Firehose delivery stream
AWS Lambda function
For my use case, Kinesis Firehose delivery stream is not what I want so I am left with:
Kinesis stream
AWS Lambda function
If I set the destination to a Kinesis Stream, I would then attach a Lambda to that stream to process the records.
AWS also offers the ability to set the destination to a Lambda, bypassing the Kinesis Stream step of this process. In doing some digging for docs I found this:
Using a Lambda Function as Output
Specifically in those docs under Lambda Output Invocation Frequency it says:
If records are emitted to the destination in-application stream within the data analytics application as a continuous query or a sliding window, the AWS Lambda destination function is invoked approximately once per second.
My Kinesis Analytics output qualifies under this scenario. So I can assume that my Lambda will be invoked, "approximately once per second".
I'm trying to understand the difference between using these 2 destinations as it pertains to using a Lambda.
Using AWS Lambda with Kinesis states that:
You can subscribe Lambda functions to automatically read batches of records off your Kinesis stream and process them if records are detected on the stream. AWS Lambda then polls the stream periodically (once per second) for new records.
So it sounds like the the invocation interval is the same in either case; approximately 1 second.
So I think the guidence is:
If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
Is this a correct assumption on how to choose a destination? Again, for my use case I am excluding the Kinesis Firehose delivery stream.

If the next stage in the pipeline only needs one consumer, then use the AWS Lambda function destination. If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
• I would always use Kinesis Stream with one shard and batch size = 1 (for example) if I wanted the items to be consumed one by one with no concurrency.
If there are multiple consumers, increase the number of shards, one lambda is launched in parallel for each shard when there are items to process. If it makes sense, also increase the batch size.
But read again at the highlighted phrase below:
If however, you need to use multiple different consumers to do different things for the same data sent to the destination, the a Kinesis Stream is more appropriate.
If you have one or more producers and many consumers of the exactly same item, I guess you need to use SNS. The producer writes the item on one topic, then all the lambdas listening to the topic will process that item.
If this does not answer your question, please clarify it. There is a little ambiguity.

Related

Multi destinations for Kinesis stream at same time

Can we have multiple destinations from single Kinesis Streams?
I am getting output in Splunk but now I also want to add an S3 bucket as the destination.
If I add another Amazon Kinesis Data Firehose, will it affect the performance of Splunk reading? Splunk pulls directly from Kinesis. If I add another destination will it affect Will it affects our current read and writes?
One of the benefits of using Kinesis is that you can do exactly this behaviour.
Each consumer application becomes responsible for which events it has read from the shard. There is no concept of an entry being processed already between 2 seperate applications.
One recommendation from AWS to bare in mind for high throughput for multiple consumers is to use enhanced fanout.
Each consumer registered to use enhanced fan-out receives its own read throughput per shard, up to 2 MB/sec, independently of other consumers.

Perform AWS Lambda (with multiple data) only after a fixed amount of data is gathered

I'd like to execute a lambda function with multiple data, only after a fixed amount of data is gathered. The fixed amount would be, for example, to consider only a specific amount of messages, or messages that are sent in a specific temporal range.
I thought to solve this problem using an SQS, on which I write the messages, and using a polling to check the SQS status. But I don't like this solution, because I'd like to trigger the lambda instantly when the criteria is matched (for example: elapsed time from the first message sent, or a fixed amount of messages)
The ideal would be to send all the messages gathered, for example, after 1 minute after the first message arrives.
To be clear:
First message arrives in the queue
From now on starts a timer (e.g 1 min)
The timer ends and It will trigger the lambda with all the messages gathered till now
Moreover, I'd like to handle different queues in parallel, based on different ids
Is there an elegant way to do so?
I have already in place a system that works with sequential lambda, that handles all the process per single message
Unfortunately, it's not an easy task to do on AWS Lambda (we have a similar use case).
SQS or Kinesis data stream as a trigger can be helpful, but have several limitations:
SQS will be pulled by AWS Lambda in a very high frequency. You will have to add a concurrency limit to your lambda to make it get triggered by more than a single item. And the maximum batch size is just 10.
The base rate for Kinesis trigger is one per second for each shard, and cannot be changed.
Aggregating records between different invocations is not a good idea because you never know if the next invocation will start on a different container so they will get lost.
Kinesis Firehose can be helpful, as you can configure max batch size and max time range for sending a new batch. You can configure it to write to an S3 bucket and configure a lambda to be triggered by new created files.
Make sure that if you use a Kinesis data stream as the source of a Kinesis firehose, the data from each shard of the data stream is seperately batched in the Firehose (this is not documented in AWS).
You can do this in a few ways. I'd do it like this:
Have the queue be an event source for a lambda function
That lambda function can: trigger a state machine OR not do anything. It triggers the state machine if there isn't one currently triggered (meaning we're in that 1 minute range).
The state machine has the following steps:
1 minute wait
Does it's processing

Does AWS Lambda process DynamoDB stream events strictly in order?

I'm in the process of writing a Lambda function that processes items from a DynamoDB stream.
I thought part of the point behind Lambda was that if I have a large burst of events, it'll spin up enough instances to get through them concurrently, rather than feeding them sequentially through a single instance. As long as two events have a different key, I am fine with them being processed out of order.
However, I just read this page on Understanding Retry Behavior, which says:
For stream-based event sources (Amazon Kinesis Data Streams and DynamoDB streams), AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days for Amazon Kinesis Data Streams. The exception is treated as blocking, and AWS Lambda will not read any new records from the stream until the failed batch of records either expires or processed successfully. This ensures that AWS Lambda processes the stream events in order.
Does "AWS Lambda processes the stream events in order" mean Lambda cannot process multiple events concurrently? Is there any way to have it process events from distinct keys concurrently?
With AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sources, the order is still guaranteed for each partition key, but not necessarily within each shard when Concurrent batches per shard is set to be greater than 1. Therefore the accepted answer needs to be revised.
Stream records are organized into groups, or shards.
According to Lambda documentation, the concurrency is achieved on shard-level. Within each shard, the stream events are processed in order.
Stream-based event sources : for Lambda functions that process Kinesis
or DynamoDB streams the number of shards is the unit of concurrency.
If your stream has 100 active shards, there will be at most 100 Lambda
function invocations running concurrently. This is because Lambda
processes each shard’s events in sequence.
And according to Limits in DynamoDB,
Do not allow more than two processes to read from the same DynamoDB
Streams shard at the same time. Exceeding this limit can result in
request throttling.

Can I limit concurrent invocations of an AWS Lambda?

I have a Lambda function that’s triggered by a PUT to an S3 bucket.
I want to limit this Lambda function so that it’s only running one instance at a time – I don’t want two instances running concurrently.
I’ve had a look through the Lambda configuration and docs, but I can’t see anything obvious. I can about writing my own locking system, but it would be nice if this was already a solved problem.
How can I limit the number of concurrent invocations of a Lambda?
AWS Lambda now supports concurrency limits on individual functions:
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
I would suggest you to use Kinesis Streams (or alternatively DynamoDB + DynamoDB Streams, which essentially have the same behavior).
You can see Kinesis Streams as as queue. The good part is that you can use a Kinesis Stream as a Trigger to you Lambda function. So anything that gets inserted into this queue will automatically be passed over to your function, in order. So you will be able to process those S3 events one by one, one Lambda execution after the other (one instance at a time).
In order to do that, you'll need to create a Lambda function with the simple purpose of getting S3 Events and putting them into a Kinesis Stream. Then you'll configure that Kinesis Stream as your Lambda Trigger.
When you configure the Kinesis Stream as your Lambda Trigger I suggest you to use the following configuration:
Batch size: 1
This means that your Lambda will be called with only one event from Kinesis. You can select a higher number and you'll get a list of events of that size (for example, if you want to process the last 10 events in one Lambda execution instead of 10 consecutive Lambda executions).
Starting position: Trim horizon
This means it'll behave as a queue (FIFO)
A bit more info on AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AWS Lambda.
I hope this helps anyone with a similar problem.
P.S. Bear in mind that Kinesis Streams have their own pricing. Using DynamoDB + DynamoDB Streams might be cheaper (or even free due to the non-expiring Free Tier of DynamoDB).
No, this is one of the things I'd really like to see Lambda support, but currently it does not. One of the problems is that if there were a lot of S3 PUT operations happening AWS would have to queue up all the Lambda invocations somehow, and there is currently no support for that.
If you built a locking mechanism into your Lambda function, what would you do with the requests you don't process due to a lock? Would you just throw those S3 notifications away?
The solution most people recommend is to have S3 send the notifications to an SQS queue, and then have your Lambda function scheduled to run periodically, like once a minute, and check if there is an item in the queue that needs to be processed.
Alternatively, have S3 send the notifications to SQS and just have a t2.nano EC2 instance with a single-threaded service polling the queue.
I know this is an old thread, but I ran across it trying to figure out how to make sure my time sequenced SQS messages were processed in order coming out of a FIFO queue and not getting processed simultaneously/out-of-order via multiple Lambda threads running.
Per the documentation:
For FIFO queues, Lambda sends messages to your function in the order
that it receives them. When you send a message to a FIFO queue, you
specify a message group ID. Amazon SQS ensures that messages in the
same group are delivered to Lambda in order. Lambda sorts the messages
into groups and sends only one batch at a time for a group. If your
function returns an error, the function attempts all retries on the
affected messages before Lambda receives additional messages from the
same group.
Your function can scale in concurrency to the number of active message
groups.
Link: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
So essentially, as long as you use a FIFO queue and submit your messages that need to stay in sequence with the same MessageGroupID, SQS/Lambda automatically handles the sequencing without any additional settings necessary.
Have the S3 "Put events" cause a message to be placed on the queue (instead of involving a lambda function). The message should contain a reference to the S3 object. Then SCHEDULE a lambda to "SHORT POLL the entire queue".
PS: S3 events can not trigger a Kinesis Stream... only SQS, SMS, Lambda (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-destinations). Kinesis Stream are expensive and used for real-time event handling.

AWS Lambda Limits when processing Kinesis Stream

Can someone explain what happens to events when a Lambda is subscribed to Kinesis item create events. There is a limit of 100 concurrent requests for an account in AWS, so if 1,000,000 items are added to kinesis how are the events handled, are they queued up for the next available concurrent lambda?
From the FAQ http://aws.amazon.com/lambda/faqs/
"Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel."
What this means is if you have 1M items added to Kinesis, but only one shard, the throttle doesn't matter - you will only have one Lambda function instance reading off that shard in serial, based on the batch size you specified. The more shards you have, the more concurrent invocations your function will see. If you have a stream with > 100 shards, the account limit you mention can be easily increased to whatever you need it to be through AWS customer support. More details here. http://docs.aws.amazon.com/lambda/latest/dg/limits.html
hope that helps!