In my team, we manage ETL jobs through Step Functions. As app requirements, we don't want to use Glue Workflows.
Most of our ETL jobs (i.e., step functions) are of the type:
Run Crawler on Data Source -> Execute Glue Job -> Run Crawler on Data Target
Now, I know that I can run .synch for AWS Glue jobs (ref), but I can't on Glue Crawlers. My question is: how do I make wait a Step Function until Crawler is done?
I thought about two solutions:
A dedicated Lambda periodically checks Crawler state. This is highly inefficient.
Step function waits for a CloudWatch event about change on Crawler state (i.e., "Succeed" or "Failed". The issue is I don't know how to implement this.
You can use EventBridge for that. EventBridge supports an event on Crawler State Change which then can trigger something in Step Functions.
Related
Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan
I have a lambda which triggers 3-5 glue jobs dynamically depending on input provided by a configuration file. Now, I have to trigger the second lambda only when all the glue jobs triggered by the previous lambda were successful else keep on waiting and if either of them failed fail the entire process and notify.
I'm trying to create the Architecture on AWS where a lambda function runs SQL Code to refresh a materialized view on AWS Redshift. I would like the materialized view to refresh after the daily ETL processes have completed on the Redshift cluster. Is there a way of setting up the lambda function to be triggered after a particular SQL command on the Redshift Cluster has completed?
Unfortunately, I've only seen examples of people scheduling the Lambda Function to run on particular intervals/at a particular time. Any help would be much appreciated.
A couple of ways that this can be done (out of many):
Have the ETL process trigger the Lambda - this is straight forward
if the ETL tool can generate the trigger but organizational factors
can make changing ETL frameworks difficult.
Use an S3 semaphore - have your ETL SQL UNLOAD some small data (like
a text string of metadata) to S3 where the objects creation will
trigger the Lambda. Insert the UNLOAD at the point in the ETL SQL
where you want the update to occur.
This is my requirement:
I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.
Questions:
How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
How to make sure that the next state - Pyspark job starts only once the crawler ran successfully.
Is there any way I can schedule the Step Function State Machine to run at a particular time?
References:
Manage AWS Glue Jobs with Step Functions
A few months late to answer this but this can be done from within the step function.
You can create the following states to achieve it:
TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is 'READY') or go to the Wait State for few seconds before you poll for it again.
RunGlueJob: Task State: A Lambda function that triggers the glue job.
WaitForCrawler: Wait State: That waits for 'n' seconds before you poll for status again.
Finish: Succeed State.
Here is how this Step Function will look like:
I'm trying to figure out how to automatically kick off an AWS Glue Job when an AWS Glue Crawler completes. I see that the Crawlers send events when they complete, but I'm struggling to parse through the documentation to figure out how to listen to that event and then launch the AWS Glue Job.
This seems like a fairly simple question, but I haven't been able to find any leads so far. I'd appreciate some help. Thanks in advance!
You can create a CloudWatch event, choose Glue Crawler state change as Event source, choose a Lambda function as Event target, and in the Lambda function you can use boto3(or other language sdk) to invoke the job to run.
Use a AWS Glue Trigger.
For anything involving more than two steps, I'd recommend using AWS Glue Workflows. They are formed by chaining Glue jobs, crawlers and triggers together into a workflow that can be visualised and monitored easily.