I am new to AWS ecosystem. I'm building a (near) real-time system, where data comes from external API. The API is updated every 10 seconds, so I would like to consume and populate my Kinesis pipeline as soon as new data appears.
However, I'm not sure which tool use for that. I did a small research and, I think, I have two options:
AWS lambda which is triggered every 10 seconds and puts data on Kinesis
AWS StepFunction
What is the standard approach for a given use case?
AWS Step functions is created by Lambda functions. That is, each step in a workflow is actually a Lambda function. You can think of a workflow created by AWS Step Functions as a chain of Lambda functions.
If you are not familiar with how to create a workflow see this AWS tutorial:
Create AWS serverless workflows by using the AWS SDK for Java
(you can create a Lambda function in any supported programming language. This one happens to use Java).
Now, to answer your question, using a workflow to populate a Kinesis data stream is possible. You can build a Lambda function that gathers data (using logic in your Lambda function), and then invoke the putRecord operation of Kinesis to populate the data stream. You can create a scheduled event that fires off every x min based on a CRON expression.
If you do use a CRON expression, you can use the AWS Step Functions API to fire off the workflow. That is, create another Lambda function that is scheduled to fire say every 10 mins. Then in this Lambda funciton, use the Step Functions API to invoke the workflow. Now the workflow can populate the Kinesis data stream with data.
Related
I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html
I'd like to peform the following tasks on a regular basis (e.g. every day at 6AM) using AWS:
get new set of data using API. This dataset is updated on a daily basis.
run a python script that would process the obtained dataset by the means of several python libraries like matplotlib, pandas, plotly
automatically send the output of the script, which would be a single pdf file or a html dashboard, via email to a group of specified recipients
I know how to perform all of the above items locally - my goal is to automate this routine. I'm new to AWS and would appreciate some advice on how to perform these tasks in a straightforward way. Based on the reading I did so far, it looks like the serverless approach may be able to do the job and also reduce the complexity, but I'm not sure which functionalities exactly I should use.
For scheduling you can use aws event bridge.
You can schedule AWS lambda or AWS Step Functions both of these are serverless :).
You can have 3 lambdas
To get the data and save it in S3/dynamo (if you want to persist the data)
Processor lambda and save the report to S3.
Another lambda to send email using AWS SES which will read the report from S3 and send it.
If you don't want to use step function you can start your lambda from S3 put event or you can trigger one lambda from another lambda using aws-sdk.
So there are different approaches you can take.
First off, I would create a Lambda. You can schedule the function to run on a cron job.
If the Message you want to send is small:
I would create a SNS Topic with a email fan out.
Inside your lambda you can then transform the data and send out via SNS.
Otherwise:
I would use SES and send a mail via the SES SDK.
I have an AWS Python lambda function that connects to a DB, checks data integrity and send alerts to a slack channel(that's already done).
I want to execute that lambda every XX minutes.
What's the best way to do it?
You can build this with AWS EventBridge.
The documentation contains an example for this exact use case:
Tutorial: Schedule AWS Lambda Functions Using EventBridge
Quick question: Is it possible to trigger the execution of a Step Function after an SQS message was sent?, if so, how would you specify it into the cloudformation yaml file?
Thanks in advance.
The first think to consider is this: do you really need to use SQS to start a Step Functions state machine? Can you use API gateway instead? Or could you write your messages to a S3 bucket and use the CloudWatch events to start a state machine?
If you must use SQS, then you will need to have a lambda function to act as a proxy. You will need to set up the queue as a lambda trigger, and you will need to write a lambda that can parse the SQS message and make the appropriate call to the Step Functions StartExecution API.
I’m on mobile, so I can’t type up the yaml right now, but if you need it, I can try to update with it later. For now, here is detailed walkthrough of how to invoke a Step Functions state machine from Lambda (including example yaml), and here is walkthrough of how to use CloudFormation to set up SQS to trigger a Lambda.
EventBridge Pipes (launched at re:Invent 2022) allows you to trigger Step Functions State Machines without need for a Lambda function.
https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-pipes.html
You can find an example here:
https://github.com/aws-samples/aws-stepfunctions-examples/blob/main/sam/demo-trigger-stepfunctions-from-sqs/template.yaml
i have an aws lambda function to do some statistics on over 1k of stock tickers after market close. i have an option like below.
setup a cron job in ec2 instance and trigger a cron job to submit 1k http request asyn (e.g. http://xxxxx.lambdafunction.xxxx?ticker= to trigger the aws lambda function (or submit 1k request to SNS and let lambda to pickup.
i think it should run fine, but much appreciate if there is any serverless/PaaS approach to trigger task
On top of my head, Here are a couple of ways to achieve what you need:
Option 1: [Cost-Effective]
Post all the ticks to AWS FIFO SQS queue.
Define triggers on this queue to invoke lambda function.
Result: Since you are posting all the events in FIFO queue that maintains the order, all the events will be polled sequentially. More-over SQS to lambda trigger will help you scale automatically based on the number of message in the queue.
Option 2: [Costly and can easily scale for real-time processing]
Same as above, but instead of posting to FIFO queue, post to Kinesis Stream.
Enable Kinesis stream to trigger lambda function.
Result: Kinesis will ensure the order of event arriving in the stream and lambda function invocation will be invoked based on the number of shards in the stream. This implementation scales significantly. If you have any future use-case for real-time processing of tickers, this could be a great solution.
Option 3: [Cost Effective, alternate to Option:1]
Collect all ticker events(1k or whatever) and put it into a file.
Upload this file to AWS S3 bucket.
Enable S3 event notification to trigger proxy lambda function.
This proxy lambda function reads the s3 file and based on the total number of events in the file, it will spawn n parallel actor lambda function.
Actor lambda function will process each event.
Result: Easy to implement, cost-effective and provides easy scaling based on your custom algorithm to distribute the load in the proxy lambda function.
Option 4: [All-serverless]
Write a lambda function that gets the list of tickers from some web-server.
Define an AWS cloud watch rule for generating events based on cron/frequency.
Add a trigger to this cloudwatch rule to invoke proxy lambda function.
Proxy lambda function will use any combination of above options[1, 2 or 3] to trigger the actor lambda function for processing the records.
Result: Everything can be configured via AWS console and easy to use. Alternatively, you can also write your AWS cloud formation template to generate all the required resources in a single go.
Having said that, now I will leave this up to you to choose the right solution based on your business/cost requirements.
You can use lambda fanout option.
You can follow these steps to process 1k or more using serverless aproach.
1.Store all the stock tickers in a S3 file.
2.Create a master lambda which will read the s3 file and split the stocks in groups of 10.
3. Create a child lambda which will make the async call to external http service and fetch the details.
4. In the master lambda Loop through these groups and invoke 100 child lambdas passing in each group and return the results to the
Master lambda
5. Collect all the information returned from the child lambdas and continue with your processing here.
Now you can trigger this master lambda at the end of markets everyday using CloudWatch time based rule scheduler.
This is a complete serverless approach.