How to handle Kinesis Data Stream without EC2 - amazon-web-services

I want to handle my Kinesis streaming data without using EC2 instance?
Is there a possibility to accomplish this ie. through Lambda functions etc?

Yes you can use Lambda service to process Kinesis streaming data. What you need to do is to create a Lambda function to process the data (the data will be available through event, first, parameter of the function).
In case of streaming data, your lambda function isn't invoked as a response to some event. Instead, Lambda service is periodically checking Kinesis for available data and then invokes your function.
For this to happen, you need to create event source mapping between your custom lambda function and Kinesis stream where you can also specify size of the batch that will be processed by lambda and its starting position.
Don't forget create proper role for your lambda function, it needs to have access to Kinesis service, so you need something like AWSLambdaKinesisExecutionRole permissions.
Another thing to consider is the batch size and how complicated your processing algorithm is. Lambda can run only for a limited time (currently 15 minutes is a maximum that you can specify), after that, it is automatically terminated by AWS. In such case, you will need to use something else than Lambda or split your lambda function into few smaller ones.

Related

Push data from external API on AWS Kinesis

I am new to AWS ecosystem. I'm building a (near) real-time system, where data comes from external API. The API is updated every 10 seconds, so I would like to consume and populate my Kinesis pipeline as soon as new data appears.
However, I'm not sure which tool use for that. I did a small research and, I think, I have two options:
AWS lambda which is triggered every 10 seconds and puts data on Kinesis
AWS StepFunction
What is the standard approach for a given use case?
AWS Step functions is created by Lambda functions. That is, each step in a workflow is actually a Lambda function. You can think of a workflow created by AWS Step Functions as a chain of Lambda functions.
If you are not familiar with how to create a workflow see this AWS tutorial:
Create AWS serverless workflows by using the AWS SDK for Java
(you can create a Lambda function in any supported programming language. This one happens to use Java).
Now, to answer your question, using a workflow to populate a Kinesis data stream is possible. You can build a Lambda function that gathers data (using logic in your Lambda function), and then invoke the putRecord operation of Kinesis to populate the data stream. You can create a scheduled event that fires off every x min based on a CRON expression.
If you do use a CRON expression, you can use the AWS Step Functions API to fire off the workflow. That is, create another Lambda function that is scheduled to fire say every 10 mins. Then in this Lambda funciton, use the Step Functions API to invoke the workflow. Now the workflow can populate the Kinesis data stream with data.

Is it possible to trigger a specific lambda function from the AWS Kinesis stream?

Is it possible to trigger a specific lambda function from the AWS Kinesis stream?
I have multiple lambda functions (CREATE/UPDATE/DELETE) which are subscribed to the kinesis stream, now I want to trigger only specific lambda function based on the data/event type.
Is it possible? If not what is the better architecture/way to handle this problem?
Sadly, you can't do this. There are generally two ways to overcome this:
Connect your stream to one lambda function. The function will receive all records from the stream and dispatch them to other functions. The dispatch can be direct, or through dedicated SQS queues for example.
/-----> SQS 1 ---> CREATE lambda
Kinesis---->Lambda ------> SQS 2 ---> UPDATE lambda
\-----> SQS 3 ---> DELETE lambda
Also use one function, but this time UPDATE, CREATE and DELETE code will be in the single function. So in the lambda handler function, you would use basic if-then-else conditions to invoke different code for UPDATE, CREATE and DELETE functionality.

AWS lambda - best practice when reading from long list/s3

I have a scheduled error handling lambda, I would like to use Serverless technology here as opposed to a spring boot service or something.
The lambda will read from an s3 bucket and process accordingly. The problem is at times the s3 bucket may have high volume of data to be processed. long running operations aren't suited to lambdas.
One solution I can think of is have the lambda read and process one item from the bucket and on success trigger another instance of the same lambda unless the bucket is empty/fully-processed. The thing i don't like is that this is synchronous and quite slow. I also need to be conscious of running too many lambdas at the same time as we are hitting a REST endpoint as part of the error flow and don't want to overload it with too many requests.
I am thinking it would be nice to have maybe 3 instances of the lambdas running at the same time until the bucket is empty but not really sure, I am wondering if anyone has any nice patterns that could be used here or suggestions on best practices?
Thanks
Create a S3 bucket for processing your files.
Enable a trigger S3 -> Lambda, on every new file in the bucket lambda will be invoked to process the file, every file is processed separately. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/enable-event-notifications.html
Once the file is processed you could either delete or move file to other place.
About concurrency please have a look at provisioned concurrency https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
Update:
As you still plan to use a scheduler lambda and S3
Lambda reads/lists only the filenames and puts messages into SQS to process the file.
A new Lambda to consume SQS messages and process the file.
Note: I would recommend using SQS initially if the files/messages are not so big, it has built it recovery mechanics, DLQ , delays, visibility etc which you could benefit more than the simple S3 storage, second way is just create message with file reference and still use SQS.
I'd separate the lambda that is called by the scheduler from the lambda that is doing the actual processing. When the scheduler calls the first lambda, it can look at the contents of the bucket, then spawn the worker lambdas to process the objects. This way you have control over how many objects you want per worker.
Given your requirements, I would recommend:
Configure an Amazon S3 Event so that a message is pushed to an Amazon SQS queue when the objects are created in the S3 bucket
Schedule an AWS Lambda function at regular intervals that will:
Check that the external service is working
Invoke a Lambda function to process one message from the queue, and keep looping
The hard part would be throttling the second Lambda function so that it doesn't try to send all request at once (which might impact that external service).
You could probably do this by using a Step Function to trigger Lambda and then, if it was successful, trigger another Lambda function. This could even be done in parallel, such as allowing up to three parallel Lambda executions. The benefit of using Step Functions is that there is no cost for "waiting" for each Lambda to finish executing.
So, the Step Function flow would be something like:
Invoke a "check external service" Lambda function
If it fails, then quit the flow
Invoke the "processing" Lambda function
Get one message
Process the message
If successful, remove the message from the queue
Return success/fail
If it was successful, keep looping until the queue is empty

Bulk invoke AWS Lambda?

I need to call AWS Lambda hundreds of thousands times, with different parameters. Is there a way to somehow execute it in bulk, passing a long list of parameters or e.g. a path to an S3 object with one line per payload?
You can pass all of the parameters through an Amazon's SQS. You can specify the batch size (i.e the number of messages that would be sent simultaneously in a group). Only issue is that the maximum number of messages sent at once is 10. Although, this should still be more efficient than processing them one by one. Alternatively you can also encode all the parameters within a single message, just keep in mind that maximum message size is 256KB.
A common technique is:
Create an Amazon SQS queue
Configure the queue to trigger your AWS Lambda function
Send messages to the queue. Each message would contain a set of input parameters that will be used by the Lambda function.
The format of the parameters is your choice because you will need to write code in the Lambda function to retrieve the parameters from the event record that is passed to the handler function.
You can configure a batch size which controls how many messages are sent to the Lambda function. You can set it to 1 to process a single message, or make it bigger and have the Lambda function loop through the messages that have been provided.
I recommend that you test the process with just a few messages before putting all the messages in the queue.
See:
Using AWS Lambda with Amazon SQS - AWS Lambda
AWS Lambda function scaling - AWS Lambda
I need to call AWS Lambda hundreds of thousands times, with different parameters.
for this you can use aws EventBridge where you can schedule a lambda as per your convenience using cron expression ( for fine granularity), or if you want to send events from different aws sources or third party that can also be done using an event pattern.
after selecting target as aws lambda you can configure input as constant jsontext,matched event, part of matched event or Input transformer
note:- eventbroidge also allows to retry policy and deal letter queue

Can I limit concurrent invocations of an AWS Lambda?

I have a Lambda function that’s triggered by a PUT to an S3 bucket.
I want to limit this Lambda function so that it’s only running one instance at a time – I don’t want two instances running concurrently.
I’ve had a look through the Lambda configuration and docs, but I can’t see anything obvious. I can about writing my own locking system, but it would be nice if this was already a solved problem.
How can I limit the number of concurrent invocations of a Lambda?
AWS Lambda now supports concurrency limits on individual functions:
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
I would suggest you to use Kinesis Streams (or alternatively DynamoDB + DynamoDB Streams, which essentially have the same behavior).
You can see Kinesis Streams as as queue. The good part is that you can use a Kinesis Stream as a Trigger to you Lambda function. So anything that gets inserted into this queue will automatically be passed over to your function, in order. So you will be able to process those S3 events one by one, one Lambda execution after the other (one instance at a time).
In order to do that, you'll need to create a Lambda function with the simple purpose of getting S3 Events and putting them into a Kinesis Stream. Then you'll configure that Kinesis Stream as your Lambda Trigger.
When you configure the Kinesis Stream as your Lambda Trigger I suggest you to use the following configuration:
Batch size: 1
This means that your Lambda will be called with only one event from Kinesis. You can select a higher number and you'll get a list of events of that size (for example, if you want to process the last 10 events in one Lambda execution instead of 10 consecutive Lambda executions).
Starting position: Trim horizon
This means it'll behave as a queue (FIFO)
A bit more info on AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AWS Lambda.
I hope this helps anyone with a similar problem.
P.S. Bear in mind that Kinesis Streams have their own pricing. Using DynamoDB + DynamoDB Streams might be cheaper (or even free due to the non-expiring Free Tier of DynamoDB).
No, this is one of the things I'd really like to see Lambda support, but currently it does not. One of the problems is that if there were a lot of S3 PUT operations happening AWS would have to queue up all the Lambda invocations somehow, and there is currently no support for that.
If you built a locking mechanism into your Lambda function, what would you do with the requests you don't process due to a lock? Would you just throw those S3 notifications away?
The solution most people recommend is to have S3 send the notifications to an SQS queue, and then have your Lambda function scheduled to run periodically, like once a minute, and check if there is an item in the queue that needs to be processed.
Alternatively, have S3 send the notifications to SQS and just have a t2.nano EC2 instance with a single-threaded service polling the queue.
I know this is an old thread, but I ran across it trying to figure out how to make sure my time sequenced SQS messages were processed in order coming out of a FIFO queue and not getting processed simultaneously/out-of-order via multiple Lambda threads running.
Per the documentation:
For FIFO queues, Lambda sends messages to your function in the order
that it receives them. When you send a message to a FIFO queue, you
specify a message group ID. Amazon SQS ensures that messages in the
same group are delivered to Lambda in order. Lambda sorts the messages
into groups and sends only one batch at a time for a group. If your
function returns an error, the function attempts all retries on the
affected messages before Lambda receives additional messages from the
same group.
Your function can scale in concurrency to the number of active message
groups.
Link: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
So essentially, as long as you use a FIFO queue and submit your messages that need to stay in sequence with the same MessageGroupID, SQS/Lambda automatically handles the sequencing without any additional settings necessary.
Have the S3 "Put events" cause a message to be placed on the queue (instead of involving a lambda function). The message should contain a reference to the S3 object. Then SCHEDULE a lambda to "SHORT POLL the entire queue".
PS: S3 events can not trigger a Kinesis Stream... only SQS, SMS, Lambda (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-destinations). Kinesis Stream are expensive and used for real-time event handling.