Passing parameters to Glue Job using Step Function - amazon-web-services

I have a Step function that enables my glue jobs to
synchronously run by passing multiple parameters from event bridge which contains the job that will be running and its arguments but when I look to my glue they are running at the same time.
{
"Comment": "A description of my state machine",
"StartAt": "Pass",
"States": {
"Pass": {
"Type": "Pass",
"Next": "Map"
},
"Map": {
"Type": "Map",
"Iterator": {
"StartAt": "Glue StartJobRun_1",
"States": {
"Glue StartJobRun_1": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName.$": "$.job_name",
"Arguments.$": "$.Arguments"
},
"End": true
}
}
},
"ItemsPath": "$.detail.config",
"End": true
}
}
}
The first glue job should finish first before I proceed with another job. Can you suggest what I can do to run them synchronously in sequence
{
"config": [
{
"job_name": "dev_1",
"Arguments": {
"--environment": "dev"
}
},
{
"job_name": "dev_2",
"Arguments": {
"--environment": "dev"
}
}
]
}

The Map state in your Step Functions workflow takes the input array and executes your states in the iterator in parallel (default 40 concurrent iterations).
To execute the Glue jobs in sequence, add "MaxConcurrency": 1 to the Map state. This will process items in the array synchronously and sequentially in the order of appearance.
Here's the modified Step Functions workflow definition
{
"Comment": "A description of my state machine",
"StartAt": "Pass",
"States": {
"Pass": {
"Type": "Pass",
"Next": "Map"
},
"Map": {
"Type": "Map",
"Iterator": {
"StartAt": "Glue StartJobRun_1",
"States": {
"Glue StartJobRun_1": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName.$": "$.job_name",
"Arguments.$": "$.Arguments"
},
"End": true
}
}
},
"ItemsPath": "$.detail.config",
"End": true,
"MaxConcurrency": 1
}
}
}

Related

AWS Step Functions Consuming messages from SQS

I am consuming messages from SQS to trigger queries.
When I normally consume a message from SQS in Python, I need to delete the message from SQS.
Do I have to manually delete the message from SQS in a Step Function?
What is the best/simplest way to do so?
I believe SQS has done the integration:
{
"Comment": "Run Redshift Queries",
"StartAt": "ReceiveMessage from SQS",
"States": {
"ReceiveMessage from SQS": {
"Type": "Task",
"Parameters": {
"QueueUrl": "******"
},
"Resource": "arn:aws:states:::aws-sdk:sqs:receiveMessage",
"Next": "Run Analysis Queries",
"ResultSelector": {
"body.$": "States.StringToJson($.Messages[0].Body)"
}
},
"Run Analysis Queries": {
"Type": "Task",
"Parameters": {
"ClusterIdentifier": "******",
"Database": "prod",
"Sql": "select * from ******"
},
"Resource": "arn:aws:states:::aws-sdk:redshiftdata:executeStatement",
"End": true
}
},
"TimeoutSeconds": 3600
}
I just did a test and it seems that the messages goes down temporarily but then goes up again.
Is the best way to insert a Lambda in between the "ReceiveMessage from SQS" stage & Redshift stage?
This raised another question. I have only run this manually. How do I activate this Step Function eventually to run on any message?
If you must use SQS, then you will need to have a lambda function to act as a proxy. You will need to set up the queue as a lambda trigger, and you will need to write a lambda that can parse the SQS message and make the appropriate call to the Step Functions StartExecution API.
After you consume a message, you have to delete it using sqs:deleteMessage. The reason you see it reappear in the queue is because once it's read by an application it becomes hidden for ~30 seconds to avoid other applications process it simultaneously.
Here is an example of how to read, process and delete a message from the queue. Mind that I added MaxNumberOfMessages equals 1 and a ResultPath different than $
"ReceiveMessage from SQS": {
"Type": "Task",
"Parameters": {
"MaxNumberOfMessages": 1,
"QueueUrl": "******"
},
"Resource": "arn:aws:states:::aws-sdk:sqs:receiveMessage",
"Next": "Run Analysis Queries",
"ResultSelector": {
"body.$": "States.StringToJson($.Messages[0].Body)"
}
},
"Run Analysis Queries": {
"Type": "Task",
"Parameters": {
"ClusterIdentifier": "******",
"Database": "prod",
"Sql": "select * from ******"
},
"Resource": "arn:aws:states:::aws-sdk:redshiftdata:executeStatement",
"ResultPath": "$.redshift_output",
"Next": "delete_sqs"
},
"delete_sqs": {
"Comment": "Deletes SQS message",
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:sqs:deleteMessage",
"Parameters": {
"ReceiptHandle.$": "$.Messages[0].ReceiptHandle",
"QueueUrl": "******"
},
"ResultPath": null,
"Next": "update_result"
}
Also, you may read up to 10 messages at a time setting MaxNumberOfMessages equals 10 along with a Map step like in this example here:
{
"StartAt": "read_sqs",
"States": {
"read_sqs": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:sqs:receiveMessage",
"Parameters": {
"MaxNumberOfMessages": 10,
"QueueUrl": "*******"
},
"ResultPath": "$.queueResponse",
"Next": "check_results"
},
"check_results": {
"Comment": "Checking if queue is empty",
"Type": "Choice",
"Choices": [
{
"Variable": "$.queueResponse.Messages[0]",
"IsPresent": true,
"Next": "map_results"
}
],
"Default": "exit"
},
"map_results": {
"Comment": "Performs a 'map' operation over each payload",
"Type": "Map",
"ItemsPath": "$.queueResponse.Messages",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "read_request",
"States": {
"read_request": {
"Comment": "Parses and moves the request body into the response",
"Type": "Pass",
"Parameters": {
"requestBody.$": "States.StringToJson($.Body)"
},
"ResultPath": "$.map_response",
"Next": "Run Analysis Queries"
},
"Run Analysis Queries": {
"Type": "Task",
"Parameters": {
"ClusterIdentifier": "******",
"Database": "prod",
"Sql": "select * from ******"
},
"Resource": "arn:aws:states:::aws-sdk:redshiftdata:executeStatement",
"ResultPath": "$.redshift_output",
"Next": "delete_sqs"
},
"delete_sqs": {
"Comment": "Deletes SQS message",
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:sqs:deleteMessage",
"Parameters": {
"ReceiptHandle.$": "$.ReceiptHandle",
"QueueUrl": "*******"
},
"ResultPath": null,
"End": true
}
}
},
"ResultPath": "$.flowResponse",
"Next": "exit"
},
"exit": {
"Type": "Pass",
"End": true
}
}
}

AWS Step function error : There are Amazon States Language errors in your state machine definition. Fix the errors to continue

I'm new to AWS step functions.
Trying to create a basic ETL flow of glue jobs. Upon completion of state machine definition im able to see the graph being generated , but getting a generic error "There are Amazon States Language errors in your state machine definition. Fix the errors to continue",
error message
that is not allowing me to proceed.
Here is the code and graph :
{
"Comment": "DRC downstream glue jobs execution step function:slf_aws_can_dbisdel_everyone_drc_amp",
"StartAt": "startFlow",
"States": {
"Comment": "various state types of the Amazon States Language",
"startFlow": {
"Comment": "Pass states are useful when constructing and debugging state machines.",
"Type": "Pass",
"Next": "stg_ods"
},
"stg_ods": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "stage_job_name"
},
"Next": "ods_job"
},
"ods_job": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "main_job_name"
},
"Next": "Wait 3 sec"
},
"Wait 3 sec": {
"Comment": "A Wait state delays the state machine from continuing for a specified time.",
"Type": "Wait",
"Seconds": 3,
"Next": "parallel_stg_adr"
},
"parallel_stg_adr": {
"Comment": "A Parallel state can be used to create parallel branches of execution in your state machine.",
"Type": "Parallel",
"Branches": [
{
"StartAt": "stg_job1",
"States": {
"stg_job1": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "stg_job_name1"
},
"End": true
}
}
},
{
"StartAt": "stg_job2",
"States": {
"stg_job2": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "stg_job_name2"
},
"End": true
}
}
}
],
"Next": "parallel_adr_job"
},
"parallel_adr_job": {
"Comment": "A Parallel state can be used to create parallel branches of execution in your state machine.",
"Type": "Parallel",
"Branches": [
{
"StartAt": "job1",
"States": {
"job1": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "some_glue_job",
"Arguments": {
"--target_table": "some_string_table",
"--calendar_key": "some_string"
}
},
"End": true
}
}
},
{
"StartAt": "job2",
"States": {
"job2": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "some_glue_job",
"Arguments": {
"--target_table": "some_string_table",
"--calendar_key": "some_string"
}
},
"End": true
}
}
}
],
"Next": "end_job"
},
"end_job": {
"Type": "Pass",
"End": true
}
}
}
Step function graph
"Comment": "various state types of the Amazon States Language",
This one at Line 5 seems to be incorrect. "States" map cannot have a "Comment" key. Remove it and then try. Rest of the config looks correct.
Edit 1
If the type of Step Function is Express, ".sync" functions won't work. Try changing the ARN to
"Resource": "arn:aws:states:::glue:startJobRun"
and you should be able to save your Step Function. You will then have to figure out how to setup a different Glue task.

aws step function parallel with input parameters

I am trying to use AWS step functions to create parallel branches of execution.
One of the parallel branches starts another step function invocation, how can we pass input from this parallel branch to next step function execution
{
"Comment": "Parallel Example.",
"StartAt": "FunWithMath",
"States": {
"FunWithMath": {
"Type": "Parallel",
"End": true,
"Branches": [
{
"StartAt": "Add", /// This receives some json object here input {}
"States": {
"Add": {
"Type": "Task", ***//How to pass the received input to the following arn as input?***
"Resource": ""arn:aws:states:::states:startExecution",
Parameters: {
"StateMachineArn": "anotherstepfunctionarnpath"
}
"End": true
}
}
},
{
"StartAt": "Subtract",
"States": {
"Subtract": {
"Type": "Task",
"Resource": "some lambda arn here,
"End": true
}
}
}
]
}
}
}
anotherstepfunctionarnpath :
{
"Comment": "Second state machine",
"StartAt": "stage1",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters":{
"Arguments":{
"Variable1" :"???" / how to access the value of the input passed to here
}
}
}
You can use Input to pass output from one SFN to other one:
First SFN(It will call second SFN)
{
"Comment": "My first SFN",
"StartAt": "First SFN",
"States": {
"First SFN": {
"Type": "Task",
"ResultPath": "$.to_pass",
"Resource": "arn:aws:lambda:us-east-1:807278658150:function:test-lambda",
"Next": "Trigger Next SFN"
},
"Trigger Next SFN": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution",
"Parameters": {
"Input": {
"Comment.$": "$"
},
"StateMachineArn": "arn:aws:states:us-east-1:807278658150:stateMachine:MyStateMachine2"
},
"End": true
}
}
}
Second SFN (MyStateMachine2)
{
"Comment": "A Hello World example of the Amazon States Language using Pass states",
"StartAt": "Hello",
"States": {
"Hello": {
"Type": "Pass",
"Result": "Hello",
"Next": "World"
},
"World": {
"Type": "Pass",
"Result": "World",
"End": true
}
}
}
First SFN's Execution
Second SFN's Execution
Explanation
The Lambda test-lambda is returning:
{
"user": "stackoverflow",
"id": "100"
}
Which is stored in "ResultPath": "$.to_pass" here in to_pass variable. I am passing the same output to next state machine MyStateMachine2 which is done by
"Input": {
"Comment.$": "$"
}
In the next State Machine's execution you see that same data is received as input which was created by first Lambda.
You can read more about it here.

AWS Step-Function: pass a specific value from one AWS lambda to another in step function parallel state

I have the below state machine. The requirement is to have a lambda to query DB and get all the ids. Next I have a parallel state call that calls more than five lambdas at once. Instead of passing all the ids fetched to all the lambdas, I need to pass the respective ids to each lambda.
In the below state language, first call is DB_CALL, lets say it returns {id1, id2, id3, id4, id5, id6}, I want to pass only id1 to First_Lambda and id2 to Second_Lambda etc...
The entire id object should get passed to all lambdas. Please suggest a way to achieve this.
{
"Comment": "Concurrent Lambda calls",
"StartAt": "StarterLambda",
"States": {
"StarterLambda": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:DB_CALL",
"Next": "ParallelCall"
},
"State": {
"ParallelCall": {
"Type": "Parallel",
"End": true,
"Branches": [
{
"StartAt": "First",
"States": {
"First": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:First_Lambda",
"TimeoutSeconds": 120,
"End": true
}
}
},
{
"StartAt": "Second",
"States": {
"Second": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Second_Lambda",
"Retry": [ {
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 1,
"MaxAttempts": 2,
"BackoffRate": 2.0
} ],
"End": true
}
}
},
{
"StartAt": "Third",
"States": {
"Third": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Third_Lambda",
"Catch": [ {
"ErrorEquals": ["States.TaskFailed"],
"Next": "CatchHandler"
} ],
"End": true
},
"CatchHandler": {
"Type": "Pass",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:CATCH_HANDLER",
"End": true
}
}
},
{
"StartAt": "Fourth",
"States": {
"Fourth": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Fourth_Lambda",
"TimeoutSeconds": 120,
"End": true
}
}
},
{
"StartAt": "Fifth",
"States": {
"Fifth": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Fifth_Lambda",
"TimeoutSeconds": 120,
"End": true
}
}
},
{
"StartAt": "Sixth",
"States": {
"Sixth": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:Sixth_Lambda",
"TimeoutSeconds": 120,
"End": true
}
}
}
}
]
}
}
}
}
You can use Step Function parameter option.
This would allow you to send specific value or json to next lambda.
"Parameters": {
"toprocess.$": "$.MetaData.CorrelationId"
},
So input to this lambda would be smaller dto than compared to you first lambda. So while returning value from this lambda avoid assigning it back to Step function result.
"OutputPath": "$",
"ResultPath": "$.PartialResutl",
What you are looking for is the Map State. With this state, you pass in the iterator, in your case the path to the ids. The map state will run once for each item in the list. Within the map state, you have a full state machine, so you can call a Lambda or any other state. It has controls to limit how many are running at once if that is needed.

Can AWS Step Function describe this kind of dataflow?

It can not be described with Parallel State in AWS Step Function.
B and C should be in parallel.
C sends messages to both D and E.
D and E should be in parallel.
{
"StartAt": "A",
"States": {
"A": {
"Type": "Pass",
"Next": "Parallel State 1"
},
"Parallel State 1": {
"Type": "Parallel",
"Branches": [{
"StartAt": "B",
"States": {
"B": {
"Type": "Pass",
"End": true
}
}
},
{
"StartAt": "C",
"States": {
"C": {
"Type": "Pass",
"End": true
}
}
}
],
"Next": "Parallel State 2"
},
"Parallel State 2": {
"Type": "Parallel",
"Branches": [{
"StartAt": "D",
"States": {
"D": {
"Type": "Pass",
"End": true
}
}
},
{
"StartAt": "E",
"States": {
"E": {
"Type": "Pass",
"End": true
}
}
}
],
"Next": "F"
},
"F": {
"Type": "Pass",
"End": true
}
}
}
Answer is No , inside step function no state can set multiple states (invokes both successors)to its Next task. As per AWS step function cannot start State Machine as StartAt by providing multiple State names.
You can tweak your logic and use The Parallel state and achive same ,If you share your usecase may be help to solve problems.
How to specify multiple result path values in AWS Step Functions
A Parallel state provides each branch with a copy of its own input
data (subject to modification by the InputPath field). It generates
output that is an array with one element for each branch, containing
the output from that branch.
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/
Example of state function
{
"Comment": "An example of the Amazon States Language using a choice state.",
"StartAt": "FirstState",
"States": {
"FirstState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"Next": "ChoiceState"
},
"ChoiceState": {
"Type" : "Choice",
"Choices": [
{
"Variable": "$.foo",
"NumericEquals": 1,
"Next": "FirstMatchState"
},
{
"Variable": "$.foo",
"NumericEquals": 2,
"Next": "SecondMatchState"
}
],
"Default": "DefaultState"
},
"FirstMatchState": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:OnFirstMatch",
"Next": "NextState"
},
"SecondMatchState": {
"Type" : "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:OnSecondMatch",
"Next": "NextState"
},
"DefaultState": {
"Type": "Fail",
"Error": "DefaultStateError",
"Cause": "No Matches!"
},
"NextState": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"End": true
}
}
}
https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html#connect-wait-example
https://sachabarbs.wordpress.com/2018/10/30/aws-step-functions/
As I answered in How to simplify complex parallel branch interdependencies for Step Functions, what you asked is better to be modeled as DAG but not state machine.
Depends on your use case, you might be able to workaround it (just as #horatiu-jeflea 's answer), but it's a workaround (not the straightforward way) anyway.