aws_stepfunctions_tasks.BatchSubmitJob takes a payload parameter - Is it possible to access the values from that payload within the job? The use case is that the original code specified payload={"count.$": "$.count"} and container_overrides=aws_stepfunctions_tasks.BatchContainerOverrides(command=["--count", "Ref::count"]), which forces the $.count output of the previous job to be a string. Since I need to use the count for another value which must be an integer I would like to avoid forcing the data type onto the previous job. Is this possible?
Related
I have Map in distributed mode step in a state machine, which iterates over large number of entries in S3 (>300,000 items).
Steps in the map all succeed, but the Map step itself then fails with The state/task 'process-s3-file' returned a result with a size exceeding the maximum number of bytes service limit. error.
I've followed advice in this stackoverflow question and set OutputPath to null, which means that every single step returns empty object now {}, but that didn't help for some reason. I've also set ResultPath of the Map step to Discard result and keep original input, but that also didn't help. I can confirm that each individual interation inside map is returning empty object ({}) now.
Am I missing something?
I was able to fix this by enabling S3 export of results. Apparently, that overwrites the output entirely and no longer triggers output size error.
I have a usecase where one of the task in Step Function is Manual Approval Step.
As a part of completion of this step, we want to pass some inputs which will be used by subsequent tasks.
This there a way to do it ?
I have seen passing JSON in output while completing the Manual Approval Step. Is there a way that we can read this output as input in next step ?
client.sendTaskSuccess(new SendTaskSuccessRequest()
.withOutput("{\"key\": \"this is value\"}")
.withTaskToken(getActivityTaskResult.getTaskToken()));
It is possible, but your question doesn't provide enough information for a specific answer. Some general tips about input/output processing:
By default, the output of a state becomes the input to the next state. You can use ResultPath to write the output of the the Task to a new field without replacing the entire JSON payload that becomes the input to the next state.
If subsequent states are using InputPath or Parameters, you might be filtering the input and removing the output of the approval step. Similarly with OutputPath.
The doc on Input and Output Processing may be helpful: https://docs.aws.amazon.com/step-functions/latest/dg/concepts-input-output-filtering.html
I am using the step functions data science SDK using python. I have a task that runs every day and the path of the data that is to be accessed in certain steps of the step functions keeps changing every day as it has the date parameter.
How can I pass the date parameter when I execute the step function and
use it so that I can access new data every day automatically.
This is an example of a step I am adding to the workflow.
etl_step = steps.GlueStartJobRunStep(
'Extract, Transform, Load',
parameters={"JobName": execution_input['GlueJobName'],
"Arguments":{
'--S3_SOURCE': data_source,
'--S3_DEST': 's3a://{}/{}/'.format(bucket, project_name),
'--TRAIN_KEY': train_prefix + '/',
'--VAL_KEY': val_prefix +'/'}
}
)
I want to add the date variable to the S3_DEST. If I use execution_input, the type isn't string so I cannot concatenate it for the path.
Edit
If the date is a datetime object you can use datetime.strftime('%Y-%m-%d')` to output it as a string.
Original
Step functions support input into them.
If you're using the SDK for start_execution then you can use the input parameter.
If you have CloudWatch event you can specify a constant from the console.
I have a transformation with several steps that run by batch script using Windows Task Scheduler.
Sometimes the first step or the n steps fail and it stops the entire transformation.
I want to transformation to run from start to end regardless of any errors, any way of doing this?
1)One way is to “error handling”, however it is not available for all the steps. You can right click on the step and check whether error handling option is available or not.
2) if you are getting errors because of incorrect datatype, for example: you are expecting a integer value and for some specific record you may get string value so it may fail , for handling such situation you can use data validation step.
Basically you can implement logic based on the transformation you have created. Above are some of the General methods.
This is what you called "Error Handling". Though your transformation runs with some Errors, you still want your transformation to continue to run.
Situations:
- Data type issues in the data stream.
Ex: say you have a column X of data type integer but by mistake you got string value. then you can define Error handling to capture all these records.
- while Processing json data.
Ex: the path you mentioned to retrieve a value of json field and for some data node the path can't identify or missing it. you can define error handling to capture all missing path details.
- while Update table
- If you are updating a table with some key, and if the key was not available as it is coming from input stream then an error will occur. you can define error handling here also.
I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.