Running AWS Glue jobs in parallel - amazon-web-services

I have 30 Glue jobs that I want to run in parallel. If one job fails, others must continue. I started with step function, creating state machine that executes runner lambda function which on other hand triggers glue job depending on parameter(name of glue job). For one job there is decent amount of step function logic implemented(retry, error handling etc.)
Is there any way to execute state machine from other state machine? In that way I can have 30 parallel tasks that executes other state machines. If you have any suggestions please feel free to share.

AWS recommends using SNS for a fan out architecture to run parallel jobs from a single S3 event, as you get an overlap error if two lambdas try to use the same S3 event.
You basically send the S3 event to SNS and subscribe your 30 lambdas so they all trigger from the SNS notification (containing details of the S3 event) when it's published.
Create the Topic
Update the Topic Policy to allow Event Notifications from an S3 Bucket
Configure the S3 Bucket to send Event Notifications to the SNS Topic
Create the parallel Lambda functions, one for each job
Modify the Lambda functions to process SNS messages of S3 event notifications instead of the S3 event itself
https://aws.amazon.com/blogs/compute/fanout-s3-event-notifications-to-multiple-endpoints/
There is also another nice example with CloudFormation template https://aws.amazon.com/blogs/compute/messaging-fanout-pattern-for-serverless-architectures-using-amazon-sns/

Related

How to keep Lambda from triggering multiple times?

TechStack: salesforce data ->Aws Appflow->s3 ->databricks job
Hello! I have an appflow flow that is grabbing salesforce data and uploading it to s3 in a folder with multiple parquet files. I have an lambda that is listening to the prefix where this folder is being dropped. This lambda then triggers a databricks job which is an ingestion process I have created.
My main issue is that when these files are being uploaded to s3 it is triggering my lambda 1 time per file that is uploaded, and was curious as to how I can have the lambda run just once.
Amazon AppFlow publishes a Flow notification - Amazon AppFlow when a Flow is complete:
Amazon AppFlow is integrated with Amazon CloudWatch Events to publish events related to the status of a flow. The following flow events are published to your default event bus.
AppFlow End Flow Run Report: This event is published when a flow run is complete.
You could trigger the Lambda function when this Event is published. That way, it is only triggered when the Flow is complete.
I hope I've understood your issue correctly but it sounds like your Lambda is working correctly if you have it setup to run every time a file is dropped into the S3 bucket as the S3 trigger will call the Lambda upon every upload.
If you want to reduce the amount of time your Lambda runs is setup an Event Bridge trigger to check the bucket for new files you could run this off an Event Bridge CRON to ping the Lambda on a defined schedule. You could then send all the files to your data bricks block in bulk rather than individually.

aws transcribe callback function

I want to call AWS transcribe function from an AWS Lambda.
In that lambda handler, I want to start the transcription job but not wait for it to finish in a while loop since it will not be cost-efficient. I don't see any way for the transcription job finish to call another Lambda, or something like that, to store the transcription information in an s3 bucket for example.
Any idea how to solve this?
See Using Amazon EventBridge with Amazon Transcribe.
With Amazon EventBridge, you can respond to state changes in your Amazon Transcribe jobs by initiating events in other AWS services. When a transcription job changes state, EventBridge automatically sends an event to an event stream. You create rules that define the events that you want to monitor in the event stream and the action that EventBridge should take when those events occur. For example, routing the event to another service (or target), which can then take an action. You could, for example, configure a rule to route an event to an AWS Lambda function when a transcription job has completed successfully.
Another alternative is:
when you call StartTranscriptionJob, you supply an S3 bucket name and S3 object key that will receive the transcribed results
you can use the Amazon S3 Event Notifications feature to notify you or to automatically trigger a Lambda function

How do I trigger a AWS lambda function only if bulk upload finished on S3?

We have a simple ETL setup below
Vendor upload crawled parquet data to our S3 bucket.
S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue.
This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.
I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:
Vendor upload file to S3.
S3 send event to SQS.
SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes.
After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.
Is this doable with S3 and SQS?
SQS hold the event,
Yes, you can do this, as you can setup SQS delay to up to 15 minues.
wait until there has been no other following event for a given time period, say 5 minutes.
No, there is not automated way for that. You have to develop your own custom solution. The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (e.g. every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.

how should i architect aws lambda to support parallel process in batch model?

i have an aws lambda function to do some statistics on over 1k of stock tickers after market close. i have an option like below.
setup a cron job in ec2 instance and trigger a cron job to submit 1k http request asyn (e.g. http://xxxxx.lambdafunction.xxxx?ticker= to trigger the aws lambda function (or submit 1k request to SNS and let lambda to pickup.
i think it should run fine, but much appreciate if there is any serverless/PaaS approach to trigger task
On top of my head, Here are a couple of ways to achieve what you need:
Option 1: [Cost-Effective]
Post all the ticks to AWS FIFO SQS queue.
Define triggers on this queue to invoke lambda function.
Result: Since you are posting all the events in FIFO queue that maintains the order, all the events will be polled sequentially. More-over SQS to lambda trigger will help you scale automatically based on the number of message in the queue.
Option 2: [Costly and can easily scale for real-time processing]
Same as above, but instead of posting to FIFO queue, post to Kinesis Stream.
Enable Kinesis stream to trigger lambda function.
Result: Kinesis will ensure the order of event arriving in the stream and lambda function invocation will be invoked based on the number of shards in the stream. This implementation scales significantly. If you have any future use-case for real-time processing of tickers, this could be a great solution.
Option 3: [Cost Effective, alternate to Option:1]
Collect all ticker events(1k or whatever) and put it into a file.
Upload this file to AWS S3 bucket.
Enable S3 event notification to trigger proxy lambda function.
This proxy lambda function reads the s3 file and based on the total number of events in the file, it will spawn n parallel actor lambda function.
Actor lambda function will process each event.
Result: Easy to implement, cost-effective and provides easy scaling based on your custom algorithm to distribute the load in the proxy lambda function.
Option 4: [All-serverless]
Write a lambda function that gets the list of tickers from some web-server.
Define an AWS cloud watch rule for generating events based on cron/frequency.
Add a trigger to this cloudwatch rule to invoke proxy lambda function.
Proxy lambda function will use any combination of above options[1, 2 or 3] to trigger the actor lambda function for processing the records.
Result: Everything can be configured via AWS console and easy to use. Alternatively, you can also write your AWS cloud formation template to generate all the required resources in a single go.
Having said that, now I will leave this up to you to choose the right solution based on your business/cost requirements.
You can use lambda fanout option.
You can follow these steps to process 1k or more using serverless aproach.
1.Store all the stock tickers in a S3 file.
2.Create a master lambda which will read the s3 file and split the stocks in groups of 10.
3. Create a child lambda which will make the async call to external http service and fetch the details.
4. In the master lambda Loop through these groups and invoke 100 child lambdas passing in each group and return the results to the
Master lambda
5. Collect all the information returned from the child lambdas and continue with your processing here.
Now you can trigger this master lambda at the end of markets everyday using CloudWatch time based rule scheduler.
This is a complete serverless approach.

I want to use amazon SQS to save the messages and use lambda to read the queue data and dump it into mysql

I am working with PHP technology.
I have my program that will write message to Amazon SQS.
Can anybody tell me how I can use lambda service to get data from SQS and push it into MySQL. Lambda service should get trigger whenever new record gets added to the queue.
Can somebody share the steps or code that will help me to get through with this task?
There isn't any official way to link SQS and Lambda at the moment. Have you looked into using an SNS topic instead of an SQS queue?
Agree with Mark B.
Ways to get events over to lambda.
use SNS http://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html
use SNS->SQS and have the lambda launched by the sns notification just use it to load whatever is in te SQS queue.
use kinesis.
alternatively have lambda run by cron job to read sqs. Depends on needed latency. If you require it be processed immediately then this is not the solution because you would be running the lambda all the time.
Important note for using SQS. You are charged when you query even if no messages are waiting. So do not do fast polls even in your lambdas. Easy to run up a huge bill doing nothing. Also good reason to make sure you set up cloudwatch on the account to monitor usage and charges.