Event Driven ETL using AWS Lambda, Glue, EventBridge/Cloudwatch - amazon-web-services

I have a lambda which triggers 3-5 glue jobs dynamically depending on input provided by a configuration file. Now, I have to trigger the second lambda only when all the glue jobs triggered by the previous lambda were successful else keep on waiting and if either of them failed fail the entire process and notify.

Related

Execute only one Glue job at a time / sequential Glue job execution

Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan

Polling SQS from Step function

I have two AWS Glue jobs in my application.
The first glue job produced some data and it passes to a lambda and lambda errors goes into a DLQ (SQS)
If there is no error messages in DLQ, I want to run the second Glue job.
I want to automate this logic with Step function. So the steps of step function will be:
run first glue job.
poll DLQ
if DLQ has no message for some time, run second glue job.
Is there any way to poll SQS for messages from step function ?
Answering the main question: SFN cannot poll SQS directly. If you really need this kind of integration, SFN can invoke a Lambda to poll a queue and return the result back to your SFN.
[Edit: The recently introduced Step Function Service Integrations now allow calling the SQS SDK ReceiveMessage and DeleteMessage APIs on a Queue as a Step Function task. Although SQS polling is now available in SFN, Queues are arguably not the best fit for the scenario in the OP].
More generally, it's not a typical pattern to use queues as part of a StepFunction architecture. SFN orchestration is really an alternative to queue-based orchestration. SFN's core function is to orchestrate tasks and it has its own try-catch-error handling logic. Not that it can't be done with queues, just not sure what the benefit is.
A typical SFN integration pattern starts with an EventBridge cron that triggers a SFN execution. Glue Job 1 is the first task - SFN calls Glue's StartJobRun as a sync-type task and waits for the result. Chain together multiple lambda, glue event stages. Add Error branches to handle failure states. The job would end with a notification task.

Step Functions - Wait until Glue Crawler is completed

In my team, we manage ETL jobs through Step Functions. As app requirements, we don't want to use Glue Workflows.
Most of our ETL jobs (i.e., step functions) are of the type:
Run Crawler on Data Source -> Execute Glue Job -> Run Crawler on Data Target
Now, I know that I can run .synch for AWS Glue jobs (ref), but I can't on Glue Crawlers. My question is: how do I make wait a Step Function until Crawler is done?
I thought about two solutions:
A dedicated Lambda periodically checks Crawler state. This is highly inefficient.
Step function waits for a CloudWatch event about change on Crawler state (i.e., "Succeed" or "Failed". The issue is I don't know how to implement this.
You can use EventBridge for that. EventBridge supports an event on Crawler State Change which then can trigger something in Step Functions.

How to include AWS Glue crawler in Step Function

This is my requirement:
I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.
Questions:
How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
How to make sure that the next state - Pyspark job starts only once the crawler ran successfully.
Is there any way I can schedule the Step Function State Machine to run at a particular time?
References:
Manage AWS Glue Jobs with Step Functions
A few months late to answer this but this can be done from within the step function.
You can create the following states to achieve it:
TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is 'READY') or go to the Wait State for few seconds before you poll for it again.
RunGlueJob: Task State: A Lambda function that triggers the glue job.
WaitForCrawler: Wait State: That waits for 'n' seconds before you poll for status again.
Finish: Succeed State.
Here is how this Step Function will look like:

Running AWS Glue jobs in parallel

I have 30 Glue jobs that I want to run in parallel. If one job fails, others must continue. I started with step function, creating state machine that executes runner lambda function which on other hand triggers glue job depending on parameter(name of glue job). For one job there is decent amount of step function logic implemented(retry, error handling etc.)
Is there any way to execute state machine from other state machine? In that way I can have 30 parallel tasks that executes other state machines. If you have any suggestions please feel free to share.
AWS recommends using SNS for a fan out architecture to run parallel jobs from a single S3 event, as you get an overlap error if two lambdas try to use the same S3 event.
You basically send the S3 event to SNS and subscribe your 30 lambdas so they all trigger from the SNS notification (containing details of the S3 event) when it's published.
Create the Topic
Update the Topic Policy to allow Event Notifications from an S3 Bucket
Configure the S3 Bucket to send Event Notifications to the SNS Topic
Create the parallel Lambda functions, one for each job
Modify the Lambda functions to process SNS messages of S3 event notifications instead of the S3 event itself
https://aws.amazon.com/blogs/compute/fanout-s3-event-notifications-to-multiple-endpoints/
There is also another nice example with CloudFormation template https://aws.amazon.com/blogs/compute/messaging-fanout-pattern-for-serverless-architectures-using-amazon-sns/