When using Step Functions, Lambdas can get input from a state and create output that can be used by the Step Functions to affect flow using a Choice state. However, while Batch jobs can also get input from Step Functions, I can't find information on how to get batch output back to Step Functions to be fed into a Choice state (as in JSON output, rather than simply the succeeded/failed state of the job).
Related
I would like to get some advice to see whether step function is suitable for my use case.
I have a bunch of user records generated at random time. I need to do some pre-processing and validation before putting them into a pool. I have a stage which runs periodically (1-5min) to collect records from the pool and combine them, then publish them.
I need realtime traceability/monitor of each record and I need to notify the user once the record is published.
Here is a diagram to illustrate the flow.
Is a step function suitable for my use case? if not, is there any alternative which help me to simplify the solution? Thanks
Yes, Step Functions is an option. Step Function "State Machines" add the greatest value vs other AWS serverless workflow patterns such as event-driven or pub/sub when the scenario involves complex branching/retry logic and observability requirements. SM logic is explicit and visual, which makes it simple to reason about the workflow. For each State Machine (SM) execution, you can easily trace the exact path the execution took and where it failed. This added functionality is reflected in its higher cost.
In any case, you need to gather records until its time to collect them. This batching requirement means that your achitecture will need more elements than just a State Machine. Here are some ideas:
(1) A SM preprocesses Records one-by-one as they arrive
One option is to use State Machines to orchestrate the preprocessing and validation only. Each arriving event record kicks off a SM execution. Pre-processed records go into a queue, from which they are periodically polled and sent to be combined.
[Records EventBrige event] -> [preprocessing SM] -> [Record queue] -> [polling lambda] -> [Combining Service]
(2) Preprocess and process bached records in a end-to-end State Machine
Gather records in a queue as they arrive. A lambda periodically polls the queue and begins the SM execution on a batch of records. A SM Map Task pre-processes and validates the records in parallel then calls the combining service, all within a single execution. This setup gives you the greatest visibility, but is more complex because you have to handle cases where a single record causes the batched execution to fail.
[Records arrive] -> [Record source queue] -> [polling lambda gets batch] -> [SM for preprocessing, collecting and combining]
Other
There are plenty of other combinations, including chaining SM's together if necessary. Or avoiding SM's altogether. Which option is best for you will depend on which pain points matter most to you: observability, error handling, simplicity, cost.
We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).
I have a usecase where one of the task in Step Function is Manual Approval Step.
As a part of completion of this step, we want to pass some inputs which will be used by subsequent tasks.
This there a way to do it ?
I have seen passing JSON in output while completing the Manual Approval Step. Is there a way that we can read this output as input in next step ?
client.sendTaskSuccess(new SendTaskSuccessRequest()
.withOutput("{\"key\": \"this is value\"}")
.withTaskToken(getActivityTaskResult.getTaskToken()));
It is possible, but your question doesn't provide enough information for a specific answer. Some general tips about input/output processing:
By default, the output of a state becomes the input to the next state. You can use ResultPath to write the output of the the Task to a new field without replacing the entire JSON payload that becomes the input to the next state.
If subsequent states are using InputPath or Parameters, you might be filtering the input and removing the output of the approval step. Similarly with OutputPath.
The doc on Input and Output Processing may be helpful: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-input-output-filtering.html
I am using the step functions data science SDK using python. I have a task that runs every day and the path of the data that is to be accessed in certain steps of the step functions keeps changing every day as it has the date parameter.
How can I pass the date parameter when I execute the step function and
use it so that I can access new data every day automatically.
This is an example of a step I am adding to the workflow.
etl_step = steps.GlueStartJobRunStep(
'Extract, Transform, Load',
parameters={"JobName": execution_input['GlueJobName'],
"Arguments":{
'--S3_SOURCE': data_source,
'--S3_DEST': 's3a://{}/{}/'.format(bucket, project_name),
'--TRAIN_KEY': train_prefix + '/',
'--VAL_KEY': val_prefix +'/'}
}
)
I want to add the date variable to the S3_DEST. If I use execution_input, the type isn't string so I cannot concatenate it for the path.
Edit
If the date is a datetime object you can use datetime.strftime('%Y-%m-%d')` to output it as a string.
Original
Step functions support input into them.
If you're using the SDK for start_execution then you can use the input parameter.
If you have CloudWatch event you can specify a constant from the console.
We are running a Dataflow job that handles multiple input streams. Some of them are high traffic and some of them rarely get messages through. We are joining all streams with a "shared" stream that contains information relevant to all elements. This is a simplified example of the pipeline:
I noticed that the job will not produce any output, until both streams contain some traffic.
For example, let's suppose that Stream 1 gets a steady flow of traffic, whereas Stream 2 does not produce any messages for a period of time. For this time, the job's DAG will show elements being accumulated in the GroupByKey step but nothing will be propagated beyond it. I can also see the Flatten PCollections step showing input elements for the left side of the graph but not the right one. This creates a problem when dealing with high traffic and low traffic streams in the same job, since it will cause output to be delayed for as much as it takes for Stream 2 to pick up messages.
I am not sure if the observation is correct, but I wanted to ask if this is how Flatten/GroupByKey works in general and if so, if the issue we're seeing can be avoided through an alternative way of constructing the pipeline.
(Example JobID: 2017-02-10_06_48_01-14191266875301315728)
As described in the documentation of group-by-key the default behavior is to wait for all data within the window to have arrived -- this is necessary to ensure correctness of down-stream results.
Depending on what you are trying to do, you may be able to use triggers to cause the aggregates to be output earlier.
You may also be able to use the slow-stream as a side-input to the processing of the fast-stream.
If you're still stuck, it would help if you could describe in more detail the contents of the streams and how you're trying to use them, since more detailed answers depend on the goal.