When running GitHub actions with a concurrency restriction, can I get workflow runs enqueued rather than cancelled? - concurrency

The documentation of GitHub actions says:
You can use jobs.<job_id>.concurrency to ensure that only a single job or workflow using the same concurrency group will run at a time.
...
When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.
It is annoying that previously pending jobs get cancelled. Evidently the orchestration logic can only maintain a tiny "queue" of one (1) pending job.
I would like to be able to have multiple jobs enqueued. I.e., if I trigger 5 jobs in rapid succession, and they all belong to the same concurrency group, then the first one starts to run immediately (when a runner is availble) and the next 4 get enqueued and wait for their turn to run, one at a time.
Is there any way to achieve this? Or will I need to request this as a feature from GitHub?

Related

How to limit number of concurrent workflows running?

The title is pretty much the question. Is there some way to limit the number of concurrent workflows running at any given time?
Some background:
I'm using eventarc to dispatch a workflow once a message has been sent to a pubsub topic. The workflow will be used to start some long-running operation (LRO) but for reasons I won't go into, I don't want more than 3 instances of this workflow running at a given time.
Is there some way to do this? - primarily from some type of configuration rather than using another compute resource.
There is no configuration to limit running processes that specifically targets sessions that are executed by a Workflow enabled for concurrent execution.
The existing process limit applies to all sessions without differentiating between those from non-concurrent or concurrent enabled Workflows.
Synchronization enables users to limit the parallel execution of certain workflows or templates within a workflow without having to restrict others.
Users can create multiple synchronization configurations in the ConfigMap that can be referred to from a workflow or template within a workflow. Alternatively, users can configure a mutex to prevent concurrent execution of templates or workflows using the same mutex.
Refer to this link for more information.
Summarizing your requirements:
Trigger workflow executions with Pub/Sub messages
Execute at most 3 workflow executions concurrently
Queue up waiting Pub/Sub messages
(Unspecified) Do you need messages processed in the order delivered?
There is no out-of-the box capability to achieve this. For fun, below is a solution that doesn't need secondary compute (and therefore is still fully managed).
The key to making this work is likely starting new executions for every message, but waiting in that execution if needed. Workflows does not provide a global concurrency construct, so you'll need to use some external storage, such as Firestore. An algorithm like this could work:
Create a callback
Push the callback into a FIFO queue
Atomically increment a counter (which returns the new value)
If the returned value is <= 3, pop the last callback and call it
Wait on the callback
-- MAIN WORKFLOW HERE --
Atomically decrement the counter
If the returned value is < 3, pop the last callback and call it
To keep things cleaner, you could put the above steps in a the triggered workflow and the main logic in a separate workflow that is called as needed.

Is there a way to finish manual task synchronously (without waiting for async result) if some precondition is satisfied?

I am using AWS SWF and flow framework. I wanted to make my activities idempotent so that a workflow can be restarted from the beginning after any failure. Many of the activities are manual tasks (#ManualActivityCompletion) which need to be completed asynchronously.
Is there a way to finish manual tasks like normal tasks if I know that it is already complete? This way a new manual task will not be scheduled everytime the workflow is retried.
Or, is there a way to retry a workflow so that it starts from the point it failed?
Currently there is no way to override activity completion behavior at runtime. The work around is to complete activity using ManualActivityCompletionClient from within activity implementation.
There is no supported way to retry workflow to start from the point of failure.

Approach to crashed workers in amazon swf

We're currently implementing a workflow in Amazon SWF where we submit jobs/workflow executions from our web application. Everything was fairly quick and painless to get set up using the Ruby Flow framework. As long as the deciders/activity workers don't crash we seem to be able to handle most issues/exceptions gracefully.
My question is, what is common practice for the scenario where the decider process crashes midway through a workflow execution? If the task fails in that way, is it possible to push an SNS notification (I've seen no examples) or something to indicate to another process that there's been an unexpected failure/crash?
There are various types of "decider" failures.
Workflow worker crashes while processing a decision. The decision task is automatically rescheduled after specified timeout. Make sure that workflow type defaultTaskStartToCloseTimeout is not set too high. If this crash is not related to code correctness then rescheduled task is processed and workflow execution continues normally.
Workflow worker doesn't crash but workflow execution itself fails. In this case you can use ListClosedWorkflowExecutions to count such failed workflows.
Workflow worker doesn't crash but a decision task cannot complete as RespondDecisionTaskCompleted fails due to a bug in the Flow framework. As from SWF point of view task is never completed it at some point is marked as timed out and rescheduled. As bug is still present a new task is again never completes and rescheduled, and so on. The workflow execution that is experiencing such issue has a history with a tail that consists from repeated "decision task scheduled, decision task timed out" events. If your workflow has a known execution time limit then the best way to catch this issue is to set reasonable executionStartToCloseTimeout and look for timed out workflow executions. If the decision task timeout is set too low such workflows can also hit the limit on history size before the execution timeout.
All swf metrics are not published to cloud watch. So all completed and failed workflows will send the metrics to cloudwatch where you can create alarms to send you notifications when any workflow fails.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

What's common practice for enabling an locking mechanism for multiple SQS consumers in Django so I can be idempotent

SQS expects your application to be idempotent and I've got multiple consumers/producers where (even if SQS had a deliver-once mechanism) I will have race conditions creating duplicates and race conditions consuming because my consumers run via cron jobs.
My current plan is to use the Django 1.4 select_for_update which should block other consumers on the same row, doing something like:
reminders = EmailReminder.objects.select_for_update().filter(id=some_id)
if not reminders[0].finished:
reminder.send()
reminder.update(finished=datetime.now())
# Delete job.
Are there better ways of dealing with this?
Hook up django-celery to SQS and have it designate a periodic job using celerybeat. Then have celeryd worker(s) running on the same queue anywhere you want. Only one will pick up a job at a time and execute it. No need to introduce DB locking on any level.
As long as your worker is guaranteed to finish its current task before celerybeat fires a new one you will never have a need for a lock. Now if you think there is a chance they may overlap you can introduce states for your notifications where:
Any reminder starts in "unsent" state.
Your celerybeat sends a request to process unsent emails to the queue.
Some worker picks it up and grabs all of them.
Immediately the worker transitions all of them to "sending" state.
Proceeds to send them one at a time (or in bulk).
If sending fails for any, revert their state back to unsent.
For all that succeeded transition to sent.
This way if celerybeat fires another job while your original job is not done with the initial batch, you won't have duplicate emails sent. As an added bonus you can scale the solution and distribute the load.