This is my requirement:
I have a crawler and a pyspark job in AWS Glue. I have to setup the workflow using step function.
Questions:
How can I add Crawler as the first state. What are the parameters I need to provide(Resource,Type etc).
How to make sure that the next state - Pyspark job starts only once the crawler ran successfully.
Is there any way I can schedule the Step Function State Machine to run at a particular time?
References:
Manage AWS Glue Jobs with Step Functions
A few months late to answer this but this can be done from within the step function.
You can create the following states to achieve it:
TriggerCrawler: Task State: Triggers a Lambda function, within this lambda function you can write code for triggering AWS Glue Crawler using any of the aws-sdk
PollCrawlerStatus: Task state: Lambda function that polls for Crawler status and returns it as a response of lambda.
IsCrawlerRunSuccessful: Choice State: Based on that status of Glue crawler you can make Next state to be a Choice state which will either go to the next state that triggers yours Glue job (once the Glue crawler state is 'READY') or go to the Wait State for few seconds before you poll for it again.
RunGlueJob: Task State: A Lambda function that triggers the glue job.
WaitForCrawler: Wait State: That waits for 'n' seconds before you poll for status again.
Finish: Succeed State.
Here is how this Step Function will look like:
Related
Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan
I have a lambda which triggers 3-5 glue jobs dynamically depending on input provided by a configuration file. Now, I have to trigger the second lambda only when all the glue jobs triggered by the previous lambda were successful else keep on waiting and if either of them failed fail the entire process and notify.
We have a simple ETL setup below
Vendor upload crawled parquet data to our S3 bucket.
S3 event trigger a lambda function, which will trigger a glue crawler to update the existing table partition in glue.
This works fine most of the times, but in some cases our vendor might upload files consecutively in a short time period, for example when refreshing history data. This will cause an issue since glue crawler cannot run concurrently and the job will fail.
I'm wondering if there is anything we can do to avoid the potential error. I've looked into SQS but not exactly sure if this can help me, below is what I would like to achieve:
Vendor upload file to S3.
S3 send event to SQS.
SQS hold the event, wait until there has been no other following event for a given time period, say 5 minutes.
After no further event in 5 minutes, SQS trigger the lambda function to run the glue crawler.
Is this doable with S3 and SQS?
SQS hold the event,
Yes, you can do this, as you can setup SQS delay to up to 15 minues.
wait until there has been no other following event for a given time period, say 5 minutes.
No, there is not automated way for that. You have to develop your own custom solution. The most trivial way would be to not bundle SQS with lambda, and instead have lambda running on schedule (e.g. every 5 minutes). Lambda would have to have logic to determine if there are no new files uploaded after some time, and then trigger your Glue Job. Probably this would involve DynamoDB to keep track of last uploaded files between lambda executions.
In my team, we manage ETL jobs through Step Functions. As app requirements, we don't want to use Glue Workflows.
Most of our ETL jobs (i.e., step functions) are of the type:
Run Crawler on Data Source -> Execute Glue Job -> Run Crawler on Data Target
Now, I know that I can run .synch for AWS Glue jobs (ref), but I can't on Glue Crawlers. My question is: how do I make wait a Step Function until Crawler is done?
I thought about two solutions:
A dedicated Lambda periodically checks Crawler state. This is highly inefficient.
Step function waits for a CloudWatch event about change on Crawler state (i.e., "Succeed" or "Failed". The issue is I don't know how to implement this.
You can use EventBridge for that. EventBridge supports an event on Crawler State Change which then can trigger something in Step Functions.
I am new to AWS world and I am trying to implement a process where data written into S3 by AWS EMR can be loaded into AWS Redshift. I am using terraform to create S3 and Redshift and other supported functionality. For loading data I am using lambda function which gets triggered when the redshift cluster is up . The lambda function has the code to copy the data from S3 to redshift. Currently the process seams to work fine .The amount of data is currently low
My question is
This approach seems to work right now but I don't know how it will work once the volume of data increases and what if lambda functions times out
can someone please suggest me any alternate way of handling this scenario even if it can be handled without lambda .One alternate I came across searching for this topic is AWS data pipeline.
Thank you
A server-less approach I've recommended clients move to in this case is Redshift Data API (and Step Functions if needed). With the Redshift Data API you can launch a SQL command (COPY) and close your Lambda function. The COPY command will run to completion and if this is all you need to do then your done.
If you need to take additional actions after the COPY then you need a polling Lambda that checks to see when the COPY completes. This is enabled by Redshift Data API. Once COPY completes you can start another Lambda to run the additional actions. All these Lambdas and their interactions are orchestrated by a Step Function that:
launches the first Lambda (initiates the COPY)
has a wait loop that calls the "status checker" Lambda every 30 sec (or whatever interval you want) and keeps looping until the checker says that the COPY completed successfully
Once the status checker lambda says the COPY is complete the step function launches the additional actions Lambda
The Step function is an action sequencer and the Lambdas are the actions. There are a number of frameworks that can set up the Lambdas and Step Function as one unit.
With bigger datasets, as you already know, Lambda may time out. But 15 minutes is still a lot of time, so you can implement alternative solution meanwhile.
I wouldn't recommend data pipeline as it might be an overhead (It will start an EC2 instance to run your commands). Your problem is simply time out, so you may use either ECS Fargate, or Glue Python Shell Job. Either of them can be triggered by Cloudwatch Event triggered on an S3 event.
a. Using ECS Fargate, you'll have to take care of docker image and setup ECS infrastructure i.e. Task Definition, Cluster (simple for Fargate).
b. Using Glue Python Shell job you'll simply have to deploy your python script in S3 (along with the required packages as wheel files), and link those files in the job configuration.
Both of these options are serverless and you may chose one based on ease of deployment and your comfort level with docker.
ECS doesn't have any timeout limits, while timeout limit for Glue is 2 days.
Note: To trigger AWS Glue job from Cloudwatch Event, you'll have to use a Lambda function, as Cloudwatch Event doesn't support Glue start job yet.
Reference: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html