I have submitted 3 jobs in parallel in AWS Batch and I wanted to create a trigger when all these 3 jobs are completed.
Something like I should be able to specify 3 job ids and can update DB once all 3 jobs are done.
I can do this task easily by having long pooling but wanted to do something based on events.
I need your help with this.
The easiest option would be to create a fourth Batch job that specifies the other three jobs as dependencies. This job will sit in the PENDING state until the other three jobs have succeeded, and then it will run. Inside that job, you could update the DB or do whatever other actions you wanted.
One downside to this approach is that if one of the jobs fails, the pending job will automatically go into a FAILED state without running.
Related
I have two AWS Glue workflows, WorkflowA and WorkflowB, in which I want to run WorkflowA based on a schedule, but WorkflowB should run only after successful completion of WorkflowA.
I attempted to create a trigger called startWorkflowB, that is Event-based on the successful completion of WorkflowA's last task.
Now, when I try to use startWorkflowB trigger in WorkflowB as the first task, I get the following error -
My end result should be that WorkflowB should run only after successful completion of WorkflowA.
What am I doing wrong here ? Is it feasible to have linear dependency between two workflows within multiple Gluw workflows ?
I have a flow that once a day:
(cron at hour H): Creates a AWS CloudWatch Export Tag
(cron at hour H+2): Consumes the logs exported in step 1
Things were kept simple by design:
The two steps are separate scripts that don't relate to each other.
Step 2 doesn't have the task ID created in step 1.
Step 2 is not aware of step 1 and doesn't know if the logs export task is finished.
I could add a mechanism by which the first script publishes the task ID somewhere and the second script consumes that task ID and queries CloudWatch to check if the task is finished and only proceeds when it is.
However, I'd prefer to keep it where there's no such handoff from step 1 to step 2.
What I'd like to do is when the log export is done, step 2 automatically starts.
👉 Is there an event "CloudWatch Export Task finished" that can be used to trigger the start of step 2?
I have to implement functionality that requires delayed sending of a message to a user once on a specific date, which can be anytime - from tomorrow till in a few months from now.
All our code is so far implemented as lambda functions.
I'm considering three options on how to implement this:
Create an entry in DynamoDB with hash key being date and range key being unique ID. Schedule lambda to run once a day and pick up all entries/tasks scheduled for this day, send a message for each of them.
Using SDK Create cloudwatch event rule with cron expression indicating single execution and make it invoke lambda function (target) with ID of user/message. The lambda would be invoked on a specific schedule with a specific user/message to be delivered.
Create a step function instance and configure it to sleep & invoke step with logic to send a message when the right moment comes.
Do you have perhaps any recommendation on what would be best practice to implement this kind of business requirement? Perhaps an entirely different approach?
It largely depends on scale. If you'll only have a few scheduled at any point in time then I'd use the CloudWatch events approach. It's very low overhead and doesn't involve running code and doing nothing.
If you expect a LOT of schedules then the DynamoDB approach is very possibly the best approach. Run the lambda on a fixed schedule, see what records have not yet been run, and are past/equal to current time. In this model you'll want to delete the records that you've already processed (or mark them in some way) so that you don't process them again. Don't rely on the schedule running at certain intervals and checking for records between the last time and the current time unless you are recording when the last time was (i.e. don't assume you ran a minute ago because you scheduled it to run every minute).
Step functions could work if the time isn't too far out. You can include a delay in the step that causes it to just sit and wait. The delays in step functions are just that, delays, not scheduled times, so you'd have to figure out that delay yourself, and hope it fires close enough to the time you expect it. This one isn't a bad option for mid to low volume.
Edit:
Step functions include a wait_until option on wait states now. This is a really good option for what you are describing.
As of November 2022, the cleanest approach would be to use EventBridge Scheduler's one-time schedule.
A one-time schedule will invoke a target only once at the date and time that you specify using a valid date, and a timestamp. EventBridge Scheduler supports scheduling in Universal Coordinated Time (UTC), or in the time zone that you specify when you create your schedule. You configure a one-time schedule using an at expression.
Here is an example using the AWS CLI:
aws scheduler create-schedule --schedule-expression "at(2022-11-30T13:00:00)" --name schedule-name \
--target '{"RoleArn": "role-arn", "Arn": "QUEUE_ARN", "Input": "TEST_PAYLOAD" }' \
--schedule-expression-timezone "America/Los_Angeles"
--flexible-time-window '{ "Mode": "OFF"}'
Reference: Schedule types on EventBridge Scheduler - EventBridge Scheduler
User Guide
Instead of using DynamoDB I would suggest to use s3. Store the message and time to trigger as key value pairs.
S3 to store the date and time as key value store.
Use s3 lambda trigger to create the cloudwatch rules that would target specific lambda's etc
You can even schedule a cron to a lambda that will read the files from s3 and update the required cron for the message to be sent.
Hope so this is in line with your requirements
I've got an SQS queue that will be filled with a json message when my S3 bucket has any CREATE event.
Message contains bucket and object name
Also have Docker image which contains python script that will read message from sqs. With help of that message, it will download respective object from S3. Finally script will read the object and put some values in dynamodb.
1.When submitting as single job to AWS batch, I can able achieve above use case. But it's time consuming because I have 80k object and average size of object 300 MB.
When submitting as multi-node Parallel Job. Job is getting stuck in Running state and master node goes to failed state.
Note: Object Type is MF4 (Measurement File) from vehicle logger. So need to download to local to read the object using asammdf.
Question 1: How to use AWS batch multi node parallel Job.
Question 2: Can I try any other services for achieving parallelism.
Answers with examples will be more helpful.
Thanks😊
I think you're looking for AWS Batch Array Jobs, not MNP Jobs. MNP jobs are for spreading one job across multiple hosts (MPI or NCCL).
we have 5 pipes in our data pipeline which execute on following basis:
pipe 1 - pipe 4 = daily basis
pipe 5 - end of the month.
we are considering an option to create separate pipeline for pipe 5 as it doesn't have any dependency on other pipes.
Is there any way possible I can execute all pipes except pipe 5 with likes of a decision variable as we have in OOZIE which can successfully ignore the execution of pipe 5 and complete the pipeline without any status of "error"/"Waiting on dependencies"?
You're probably better off creating multiple pipelines and setting them on different schedules. If you'd like you spice things up you can use Cloudwatch scheduling and AWS Lambda to schedule pipeline creation/deletion in a cron-like way. You could also use AWS Step functions to define the workflow of each component.