I am trying to use AWS Step Functions to trigger operations many S3 files via Lambda. To do this I am invoking a step function with an input that has a base S3 key of the file and part numbers each file (each parallel iteration would operate on a different S3 file). The input looks something like
{
"job-spec": {
"base_file_name": "some_s3_key-",
"part_array": [
"part-0000.tsv",
"part-0001.tsv",
"part-0002.tsv", ...
]
}
}
My Step function is very simple, takes that input and maps it out, however I can't seem to get both the file and the array as input to my lambda. Here is my step function definition
{
"Comment": "An example of the Amazon States Language using a map state to process elements of an array with a max concurrency of 2.",
"StartAt": "Map",
"States": {
"Map": {
"Type": "Map",
"ItemsPath": "$.job-spec",
"ResultPath": "$.part_array",
"MaxConcurrency": 2,
"Next": "Final State",
"Iterator": {
"StartAt": "My Stage",
"States": {
"My Stage": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:<>:function:some-lambda:$LATEST",
"Payload": {
"Input.$": "$.part_array"
}
},
"End": true
}
}
}
},
"Final State": {
"Type": "Pass",
"End": true
}
}
}
As written above it complains that that job-spec is not an array for the ItemsPath. If I change that to $.job-spec.array I get the array I'm looking for in my lambda but the base key is missing.
Essentially I want each python lambda to get the base file key and one entry from the array to stitch together the complete file name. I can't just put the complete file names in the array due to the limit limit of how much data I can pass around in Step Functions and that also seems like a waste of data
It looks like the Parameters value can be used for this but I can't quite get the syntax right
Was able to finally get the syntax right.
"ItemsPath": "$.job-spec.part_array",
"Parameters": {
"part_name.$": "$$.Map.Item.Value",
"base_file_name.$": "$.job-spec.base_file_name"
},
It seems that Parameters can be used to create custom inputs for each stage. The $$ is accessing the context of the stage and not the actual input. It appears that ItemsPath takes the array and puts it into a context which can be used later.
UPDATE Here is some AWS Documentation showing this being used from the comments below
Related
How can I get the results from previous Map Iterations in the next iteration when using MaxConcurrency: 1 in Amazon Step Functions?
Here's an example of the code I have
{
"StartAt": "UploadUsers",
"States": {
"UploadUsers": {
"Type": "Map",
"MaxConcurrency": 1,
"ItemsPath": "$.data.users",
"Parameters": {
"data.$": "$$.Map.Item.Value.data",
"friends.$": "$.?????? Get created users ids"
},
"Iterator": {
"StartAt": "UploadUser",
"States": {
"UploadUser": {
"End": true,
"Parameters": {
"FunctionName": "${FnUploadUser}",
"Payload": {
"data.$": "$.user_data",
"friends.$": "$.??????"
}
},
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"ResultPath": "$.data. ???",
"Type": "Task"
}
}
},
"End": true,
"ResultPath": "$.data.UploadUsers",
"ResultSelector": {
"result.$": "$"
}
}
}
}
Suppose FnUploadUser is a lambda that returns the id of the created user.
And I want to get the ids of the previously created users and use that value for the next user I'm about to create.
You can't. Map State iterations don't share state. Two workarounds:
(1) Manage the shared state externally: Each Map iteration writes and reads from, say, a DynamoDB table.
(2) Refactor to a "for" loop and keep the shared state in the execution output.
Instead of using Map, insert a Choice State (after UploadUser) that checks for a "done" condition. If "done", finish, else loop back to UploadUser.
UploadUser accepts the user_data array as input. It appends its output to, say, the uploaded output array.
Each UploadUser iteration identifies the next user_data item by comparing it to the uploaded array. The iteration that processes the last item can also output done: true to signal to Choice that work is done.
The Choice State loops back to UploadUser while there are more to process (i.e. while done is not present).
There are other ways to build steps 2-3. For instance, you could add next_item and total_items keys on the output to keep track of progress. The important point is that Choice loops until an exit condition is met.
Let's say I have this state machine in AWS Step Function:
And I had started it with this input:
{
"item1": 1,
"item2": 2,
"item3": 3
}
It's clear for me that Action A is receiving the input payload. But, how can Action C access the state machine input to get the value of item3? Is it possible?
Thanks!!
Typically, the data available in Action C will be dependent on what the result/output of Action B is.
However, if you just care about the original input to the state machine execution, you can set the payload of Action C using the Context Object.
// roughly
"Action C": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"Payload.$": "$$.Execution.Input",
"FunctionName": "<action c lambda>"
},
Check out the AWS documentation for Context Object
Basically, i have the following input:
{
"name": "abc",
"choice": "choice1"
}
My dynamoDB table has the following structure:
Partition key - "name"
Complex json with choices:
{
"choices":
{
"choice1": ......,
"choice2": ......
}
}
I want to directly read from dynamodb, and get a subitem under the relevant choice:
{
"StartAt": "Read Next Message from DynamoDB",
"States": {
"Read Next Message from DynamoDB": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "my_table",
"Key": {
"customerName": {"S.$": "$.name"}
}
},
"OutputPath": "$.Item.choices.M.choice1.M.myvalue.S",
"Next": "World"
},
"World": {
"Type": "Pass",
"End": true
}
}
}
basically i want to do something like "$.Item.choices.M.{$.choice}.M.myvalue.S", and take one of the output's keys from the input. is this possible?
I think what you're looking for is JsonPath interpolation, but that is not supported as per this thread on AWS forums.
As far as I know Step Functions allow only path reference through $, . and [] operators (Reference Path).
I don't know how much control you have on the DynamoDB table's data but I think your problem can be solved easily if your choice types are modeled in following way
{
"choices": [{
"choiceType": "choice1",
........
},
{
"choiceType": "choice2",
........
}]
}
Now you can use the map state to iterate over the choices array. Note that don't forget to pass the expected choiceType to each iteration.
First state of the map iterator can be a choice state which compares choiceType and moves to appropriate next state. So, basically your rest of the workflow is modeled as iterator of the map state in step 1.
Now, if you don't have the control over DynamoDB table, then you can process the query result in an AWS Lambda.
Update: Creating a step function from the Map State step template and running that also throws an error. This is strong evidence that the MaxConcurrency attribute together with the Parameters value is not working.
I am not able to use the MaxConcurrency attribute successfully in the step function definition.
This can be demonstrated by using the example provided in the documentation for the Map Task (new as of 18 sept 2019):
{
"StartAt": "ExampleMapState",
"States": {
"ExampleMapState": {
"Type": "Map",
"MaxConcurrency": 2,
"Parameters": {
"ContextIndex.$": "$$.Map.Item.Index",
"ContextValue.$": "$$.Map.Item.Value"
},
"Iterator": {
"StartAt": "TestPass",
"States": {
"TestPass": {
"Type": "Pass",
"End": true
}
}
},
"End": true
}
}
}
By executing the step function with the following input:
[
{
"who": "bob"
},
{
"who": "meg"
},
{
"who": "joe"
}
]
We can observe in the Execution event history that we get:
ExecutionStarted
MapStateEntered
MapStateStarted
MapIterationStarted (index 0)
MapIterationStarted (index 1)
PassStateEntered (index 0)
PassStateExited (index 0)
MapIterationSucceeded (index 0)
ExecutionFailed
The step function fails.
The ExecutionFailed step has the following output (execution id omitted):
{
"error": "States.Runtime",
"cause": "Internal Error (omitted)"
}
Trying to catch the error with a Catch step has no effect.
What am I doing wrong here? Is this a bug?
Response to a private ticket submitted to AWS this morning;
Thank you for contacting AWS Premium Support. My name is Akanksha and
I will be assisting you with this case.
I understand that you have been working with the new Map state feature
of step functions and have noticed that when we use Parameters along
with MaxConcurrency set to lower value than the number of iterations
(with only first iteration successful) it fails with ‘States.Runtime’
and looks like a bug with the functionality.
Thank you for providing the details. It helped me during
troubleshooting. In order to confirm the behavior, I used the below
state machine example with Pass:
{
"StartAt": "Map State",
"TimeoutSeconds": 3600,
"States": {
"Map State": {
"Type": "Map”,
"Parameters": {
“ContextValue.$”: "$$.Map.Item.Value"
},
"MaxConcurrency": 1,
"Iterator": {
"StartAt": "Run Task",
"States": {
"Run Task": {
"Type": "Pass",
"End": true
}
}
},
"Next": "Final State"
},
"Final State": {
"Type": "Pass",
"End": true
}
} }
I tested with multiple input lists and MaxConcurrency values and below
are my observations:
Input size list: 4 MaxConcurrency:1/2/3 - Fails and MaxConcurrency:0/4/5 or above - Works
Input size list: 3 MaxConcurrency: 1/2 - Fails and MaxConcurrency:0/3/4 or above - Works
Similarly, I performed tests by removing the parameters from state machine as well and could see that it works as expected with different
MaxConcurrency values.
I also tested the same by changing the Task type of “Pass” with “Lambda” and observed the same behavior.
Hence, I can confirm that the state machine fails when we have
parameters in the code and specify MaxConcurrency value as anything
other than zero or the number greater than or equal to the list size.
After doing some research regarding this behavior to check if this is
intended, I could not find much information regarding the same as this
is a new feature. So, I will be reaching out to the internal team with
all the details and the example state machine that you have provided.
Thank you for bringing this to our notice. I will get back to you as
soon as I have an update from the internal team. Please be assured
that I will regularly follow up with the team and work with them to
investigate further.
Meanwhile, if you have any other queries or concerns, please do let me
know.
Have a great day ahead!
I will update here when I get more information.
How can I passthrough the input to a Task state in an AWS Step Functions to the output?
After reading the Input and Output Processing page in the AWS docs, I have played with various combinations of InputPath, ResultPath and OutputPath.
State definition:
"First State": {
"Type": "Task",
"Resource": "[My Lambda ARN]",
"Next": "Second State",
"InputPath": "$.someKey",
"OutputPath": "$"
}
Input:
{
"someKey": "someValue"
}
Expected Result
I would like the output of the First State (and thus the input of Second State) to be
{
"someKey": "someValue"
}
Actual Result
[empty]
What if the input is more complicated, e.g.
{
"firstKey": "firstValue",
"secondKey": "secondValue"
}
I would like to forward all of it without worrying about (sub) paths.
In the Amazon States Language spec it is stated that:
If the value of ResultPath is null, that means that the state’s own raw output is discarded and its raw input becomes its result.
Consequently, I updated my state definition to
"First State": {
"Type": "Task",
"Resource": "[My Lambda ARN]",
"Next": "Second State",
"ResultPath": null
}
As a result, when passing the input example Task input payload will be copied to the output, even for rich objects like:
{
"firstKey": "firstValue",
"secondKey": "secondValue"
}
For those who find themselves here using CDK, the solution is to use the explicit aws_stepfunctions.JsonPath.DISCARD enum rather than None/null.
from aws_cdk import (
aws_stepfunctions,
aws_stepfunctions_tasks,
)
aws_stepfunctions_tasks.LambdaInvoke(
self,
"my_function",
lambda_function=lambda_function,
result_path=aws_stepfunctions.JsonPath.DISCARD,
)
https://docs.aws.amazon.com/cdk/api/latest/docs/#aws-cdk_aws-stepfunctions.JsonPath.html#static-discard
I was looking for a solution from passing input from one parallel state to another parallel state and the above option worked really good.
For example my step function is like this...tas1->parallel task2 -> parallel trask3 -> task4. So when it start with parallel task3, the input values are wiped out, so ptask3 is failing. With the above option, i was able to pass in same input from ptask2 to ptas3.