handling failure after maximum number of retries in google app engine task queues - python-2.7

I am using google-api-python-client and I am using google app engine task queues for some async operations.
For the specific task queue, I am also setting max number of times that the task should be retried(In my case retries are less likely to be successful, so I want to limit them).
Is there a way to write a handler which can handle the case where the task is still failing even after the specified number of retries?
Basically if my retry limit is 5, after 5 unsuccessful retries, I want to move the task to a different queue where it can be retried more number of times with a larger interval between the retries, that way it is more likely to succeed.
From here I believe that I can use X-AppEngine-TaskExecutionCount header in each retry and write some custom logic to know when the task is going to execute for the last time and achieve this but I am trying find out if there is any cleaner way.
By the way X-AppEngine-TaskExecutionCount specifies(from the doc), The number of times this task has previously failed during the execution phase. This number does not include failures due to a lack of available instance.

At least presently there is no support for automatically moving a task from one queue to another.
One option is to keep the task on the same queue, increase the max number of retries and use the retry_parameters to customize the retry backoff policy (i.e. the increase of time between retries):
retry_parameters
Optional. Configures retry attempts for failed tasks. This addition
allows you to specify the maximum number of times to retry failed
tasks in a specific queue. You can also set a time limit for retry
attempts and control the interval between attempts.
The retry parameters can contain the following subelements:
task_retry_limit
The maximum number of retry attempts for a failed task. If specified with task_age_limit, App Engine retries the task until
both limits are reached. If 0 is specified, the task will not be
retried.
task_age_limit (push queues)
The time limit for retrying a failed task, measured from when the task was first run. The value is a number followed by a unit of time,
where the unit is s for seconds, m for minutes, h for
hours, or d for days. For example, the value 5d specifies a
limit of five days after the task's first execution attempt. If
specified with task_retry_limit, App Engine retries the task until
both limits are reached.
min_backoff_seconds (push queues)
The minimum number of seconds to wait before retrying a task after it fails.
max_backoff_seconds (push queues)
The maximum number of seconds to wait before retrying a task after it fails.
max_doublings (push queues)
The maximum number of times that the interval between failed task retries will be doubled before the increase becomes constant. The
constant is: 2**max_doublings * min_backoff_seconds**.
But the pattern of the increase will be gradual - doubling after each failure, you can't get a significant "step"-like increase of the time between retries. Still, it may be a good enough solution for which no additional coding is required. Personally I'd go for this approach.
Another approach is to add that logic to determine if that execution is the final retry of the original task and, if so, enqueue a new corresponding task on a different queue which has the desired "slower" retry policy. I'm unsure if this is what you were referring to in the question and wanted to avoid.

Related

TryTimeout for EventHubProducerClient SendAsync

We can setup timeout value for EventHubProducerClient.SendAsync() while connection creation using EventHubProducerClientOptions.
I have initialized EventHubProducerClientOptions.TryTimeout with 10 seconds and EventHubProducerClientOptions.MaximumRetries to 3.
Does TryTimeout value applicable for each individual retries (10 seconds * 3 retries) or TryTimeout value is shared between all retries?
Example:
If each individual retry takes 5 seconds before failing, then first approach will timeout after 15 seconds verses second approach, which will timeout after 10 seconds with 2 retries.
The TryTimeout governs a single service operation. In the case of transient failures with retries, the EventHubsRetryOptions for MaximumDelay, Mode, and (sometimes) Delay control the delay between those attempts.
You'll see the call take MaximumRetries * TryTimeout * retry policy delays. With the default exponential retry pattern, delays contain a small amount of random jitter, so you cannot reasonably predict the exact amount of delay.
To illustrate, let's take your scenario and assume that each service operation times out. The pattern that you would see is:
Your application makes a call to SendAsync
The service operation times out after 10 seconds (TryTimeout)
The failure is deemed transient, the retry policy is consulted and specifies a delay. This is governed by the retry options and will be <= the MaximumDelay value.
After the retry delay, the service operation is made a second time and times out after 10 seconds.
The failure is deemed transient, the retry policy specifies a delay.
After the retry delay, the service operation is made for a third time and times out after 10 seconds.
The failure is deemed transient, the retry policy recognizes that it is out of retries (MaximumRetries was set to 3) and does not specify a delay.
The call to SendAsync returns control to your application, throwing a TimeoutException.

How do parallel multi instance loop work in Camunda 7.16.6

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?
I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

When a state of a step function times out, does the lambda execution correlated to it continue to be performed?

I want to know if a lambda execution continues to be performed even if the state of the step function correlated to it times out. If it happens, how can i stop it?
There is no way to kill a running lambda. However, you can set concurrency limit to 0 to stop it from starting further executions
Standard StepFunctions have a max timeout of 1 year. (yes! One year)
As such any individual task also has a max timeout of 1 year.
(Express StepFunctions have a timeout of 30 seconds mind you)
Lambda's have a max time out of 15 mins.
If you need your lambda to complete in a certain amount of time, you are best served by setting your lambda timeout to that - not your state machine. (i see in your comments you say you cannot pass a value for this? If you cannot change it then you have no choice but to let it run its course)
Consider StepFunctions and state machines to be orchestrators, but they have very little control over the individual components. They tell who to act and when but otherwise are stuck waiting on those components to reply before continuing.
If your lambda times out, it will cause your StateMachine to fail that task as as it receives a lambda service error. You can then handle that in the StepFunction without failing the entire process, see:
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
You could specifically use: TimeoutSecondsPath in your definition to set a specific result if the task timesout.
But as stated, no, once a lambda begins execution it will continue until it finishes or it times out at 15 mins / its set timeout.

How to limit concurrency of a step in step functions

I have a state machine in AWS. I want to limit concurrency of a task (created via lambda) to reduce traffic to one of my downstream API.
I can restrict the lambda concurrency, but the task fails with "Lambda.TooManyExecutions" failure. Can someone please share a simple approach to limit concurrency of a lambda task?
Thanks,
Vinod.
Within the same state machine execution
You can use a Map state to run these tasks in parallel, and use the maximum concurrency setting to reduce excessive lambda executions.
The Map state ("Type": "Map") can be used to run a set of steps for each element of an input array. While the Parallel state executes multiple branches of steps using the same input, a Map state will execute the same steps for multiple entries of an array in the state input.
MaxConcurrency (Optional)
The MaxConcurrency field’s value is an integer that provides an upper bound on how many invocations of the Iterator may run in parallel. For instance, a MaxConcurrency value of 10 will limit your Map state to 10 concurrent iterations running at one time.
This should reduce the likelihood of issues. That said, you would still benefit from adding a retry statement for these cases. Here's an example:
{
"Retry": [ {
"ErrorEquals": ["Lambda.TooManyRequestsException", "Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
} ]
}
Across different executions
If you want to control this concurrency across different executions, you'll have to implement some kind of separate control yourself. One way to prepare your state machine for that is to request the data you need and then using an Activity to wait for a response.
You can use the lambda concurrency you mentioned but then add a retry clause to your step function so that when you hit the concurrency limit, step functions manages the retry of that task that failed.
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-examples
There’s a limit to the number of retries, but you get to define it.
Alternatively , if you want to retry without limit, you could use catch to move to a Wait state when that concurrency is thrown. You can read about catch in the link above too. Here’s a wait state doc.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
You just have wait state transition back to the task state after it completes its wait.

Multiple Timers in C++ / MySQL

I've got a service system that gets requests from another system. A request contains information that is stored on the service system's MySQL database. Once a request is received, the server should start a timer that will send a FAIL message to the sender if the time has elapsed.
The problem is, it is a dynamic system that can get multiple requests from the same, or various sources. If a request is received from a source with a timeout limit of 5 minutes, and another request comes from the same source after only 2 minutes, it should be able to handle both. Thus, a timer needs to be enabled for every incoming message. The service is a web-service that is programmed in C++ with the information being stored in a MySQL database.
Any ideas how I could do this?
A way I've seen this often done: Use a SINGLE timer, and keep a priority queue (sorted by target time) of every timeout. In this way, you always know the amount of time you need to wait until the next timeout, and you don't have the overhead associated with managing hundreds of timers simultaneously.
Say at time 0 you get a request with a timeout of 100.
Queue: [100]
You set your timer to fire in 100 seconds.
Then at time 10 you get a new request with a timeout of 50.
Queue: [60, 100]
You cancel your timer and set it to fire in 50 seconds.
When it fires, it handles the timeout, removes 60 from the queue, sees that the next time is 100, and sets the timer to fire in 40 seconds. Say you get another request with a timeout of 100, at time 80.
Queue: [100, 180]
In this case, since the head of the queue (100) doesn't change, you don't need to reset the timer. Hopefully this explanation makes the algorithm pretty clear.
Of course, each entry in the queue will need some link to the request associated with the timeout, but I imagine that should be simple.
Note however that this all may be unnecessary, depending on the mechanism you use for your timers. For example, if you're on Windows, you can use CreateTimerQueue, which I imagine uses this same (or very similar) logic internally.