can we pass a dynamic variable to aws step functions on execution? - amazon-web-services

I am using the step functions data science SDK using python. I have a task that runs every day and the path of the data that is to be accessed in certain steps of the step functions keeps changing every day as it has the date parameter.
How can I pass the date parameter when I execute the step function and
use it so that I can access new data every day automatically.
This is an example of a step I am adding to the workflow.
etl_step = steps.GlueStartJobRunStep(
'Extract, Transform, Load',
parameters={"JobName": execution_input['GlueJobName'],
"Arguments":{
'--S3_SOURCE': data_source,
'--S3_DEST': 's3a://{}/{}/'.format(bucket, project_name),
'--TRAIN_KEY': train_prefix + '/',
'--VAL_KEY': val_prefix +'/'}
}
)
I want to add the date variable to the S3_DEST. If I use execution_input, the type isn't string so I cannot concatenate it for the path.

Edit
If the date is a datetime object you can use datetime.strftime('%Y-%m-%d')` to output it as a string.
Original
Step functions support input into them.
If you're using the SDK for start_execution then you can use the input parameter.
If you have CloudWatch event you can specify a constant from the console.

Related

How to read batch submit job payload in job?

aws_stepfunctions_tasks.BatchSubmitJob takes a payload parameter - Is it possible to access the values from that payload within the job? The use case is that the original code specified payload={"count.$": "$.count"} and container_overrides=aws_stepfunctions_tasks.BatchContainerOverrides(command=["--count", "Ref::count"]), which forces the $.count output of the previous job to be a string. Since I need to use the count for another value which must be an integer I would like to avoid forcing the data type onto the previous job. Is this possible?

AWS Batch output to Step Functions

When using Step Functions, Lambdas can get input from a state and create output that can be used by the Step Functions to affect flow using a Choice state. However, while Batch jobs can also get input from Step Functions, I can't find information on how to get batch output back to Step Functions to be fed into a Choice state (as in JSON output, rather than simply the succeeded/failed state of the job).

How to use Apache beam to process Historic Time series data?

I have the Apache Beam model to process multiple time series in real time. Deployed on GCP DataFlow, it combines multiple time series into windows, and calculates the aggregate etc.
I now need to perform the same operations over historic data (the same (multiple) time series data) stretching all the way back to 2017. How can I achieve this using Apache beam?
I understand that I need to use the windowing property of Apache Beam to calculate the aggregates etc, but it should accept data from 2 years back onwards
Effectively, I need data as would have been available had I deployed the same pipeline 2 years. This is needed for testing/model training purposes
That sounds like a perfect use case of Beam's focus on event-time processing. You can run the pipeline against any legacy data and get correct results as long as events have timestamps. Without additional context I think you will need to have an explicit step in your pipeline to assign custom timestamps (from 2017) that you will need to extract from the data. To do this you can probably use either:
context.outputWithTimestamp() in your DoFn;
WithTimestamps PTransform;
You might need to have to configure allowed timestamp skew if you have the timestamp ordering issues.
See:
outputWithTimestamp example: https://github.com/apache/beam/blob/efcb20abd98da3b88579e0ace920c1c798fc959e/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/windowing/WindowingTest.java#L248
documentation for WithTimestamps: https://beam.apache.org/releases/javadoc/2.13.0/org/apache/beam/sdk/transforms/WithTimestamps.html#of-org.apache.beam.sdk.transforms.SerializableFunction-
similar question: Assigning to GenericRecord the timestamp from inner object
another question that may have helpful details: reading files and folders in order with apache beam

PDI - Check data types of field

I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.

Is it possible to use one parameter in another in AWS Data Pipeline?

Current setup:
There's a master data source that contains attendance records per day for students in a given school. Imagine the data is structured in a CSV format like so:
name|day|in_attendance
jack|01/01/2018|0
and so on and so forth, throughout the entire year. Now, the way we grab attendance information from a specific period in time is to specify the year & month via parameters we're handing to an AWS Datapipeline Step, like so:
myAttendanceLookupStep: PYTHON=python34,s3://school_attendance_lookup.py,01,2018
that step runs the Python file defined, and 01 and 2018 specify month and year we're looking up. However, I want to change it so that it looks more like this:
myAttendanceLookupStep: PYTHON=python34,s3://school_attendance_lookup.py,%myYear,%myMonth
myYear: 2018
myMonth: 01
Is there any way to achieve this kind of behavior in AWS Data Pipeline?
It turns out that the syntax I was using in the example wasn't far from the proper syntax. You can use supplied parameters in any portion of the pipeline (activities, etc.) - you do #{myParameterName} in place of where the parameter would go.
This doesn't appear to be documented in AWS Data Pipeline documentation.