Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan
Related
I have a lambda which triggers 3-5 glue jobs dynamically depending on input provided by a configuration file. Now, I have to trigger the second lambda only when all the glue jobs triggered by the previous lambda were successful else keep on waiting and if either of them failed fail the entire process and notify.
In my team, we manage ETL jobs through Step Functions. As app requirements, we don't want to use Glue Workflows.
Most of our ETL jobs (i.e., step functions) are of the type:
Run Crawler on Data Source -> Execute Glue Job -> Run Crawler on Data Target
Now, I know that I can run .synch for AWS Glue jobs (ref), but I can't on Glue Crawlers. My question is: how do I make wait a Step Function until Crawler is done?
I thought about two solutions:
A dedicated Lambda periodically checks Crawler state. This is highly inefficient.
Step function waits for a CloudWatch event about change on Crawler state (i.e., "Succeed" or "Failed". The issue is I don't know how to implement this.
You can use EventBridge for that. EventBridge supports an event on Crawler State Change which then can trigger something in Step Functions.
I am using the AWS Glue service with two separate workflows (let's say workflow A and workflow B).
I have created a conditional-type trigger in workflow B that watches jobs in workflow A and supposedly fires when they succeed. Can this trigger actually fire if it watches jobs from workflow A (i.e. a different workflow)?
I have tested this a few times, but it seems the jobs in workflow B that are supposed to be triggered by this specific trigger don't seem to run despite all jobs that are being watched succeeding.
I can't seem to find any information on this specific AWS Glue setup.
From the AWS docs it seems that triggers that start a workflow must be of one of the following types:
Schedule
On demand
EventBridge event
Link: https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html
A solution to my problem might be to omit Workflows entirely and just create triggers and jobs.
I have an AWS glue job with Spark UI enabled by following this instruction: Enabling the Spark UI for Jobs
The glue job has s3:* access to arn:aws:s3:::my-spark-event-bucket/* resource. But for some reason, when I run the glue job (and it successfully finished within 40-50 seconds and successfully generated the output parquet files), it doesn't generate any spark event logs to the destination s3 path. I wonder what could have gone wrong and if there is any systematic way for me to pinpoint the root cause.
How long is your Glue job running for?
I found that jobs with short execution times, less then or around 1 min do not reliably produce Spark UI logs in S3.
The AWS documentation states "Every 30 seconds, AWS Glue flushes the Spark event logs to the Amazon S3 path that you specify." the reason short jobs do not produce Spark UI logs probably has something to do with this.
If you have a job with a short execution time try adding additional steps to the job or even a pause/wait to length the execution time. This should help ensure that the Spark UI logs are sent to S3.
This is my requirement:
I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.
Questions:
How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
How to make sure that the next state - Pyspark job starts only once the crawler ran successfully.
Is there any way I can schedule the Step Function State Machine to run at a particular time?
References:
Manage AWS Glue Jobs with Step Functions
A few months late to answer this but this can be done from within the step function.
You can create the following states to achieve it:
TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is 'READY') or go to the Wait State for few seconds before you poll for it again.
RunGlueJob: Task State: A Lambda function that triggers the glue job.
WaitForCrawler: Wait State: That waits for 'n' seconds before you poll for status again.
Finish: Succeed State.
Here is how this Step Function will look like: