Once in a while, I have a job that requires much longer processing than anticipated so i'd like to disable timeout if possible.
You can pass job_timeout when enqueuing a job and that'll be preserved, default timeout is 3 minutes (180), I believe your function is taking more than three minutes.
By default, jobs should execute within 180
seconds. After that, the worker kills the work horse and puts the job
onto the failed queue, indicating the job timed out.
If a job requires more (or less) time to complete, the default timeout
period can be loosened (or tightened), by specifying it as a keyword
argument to the enqueue() call, like so:
q = Queue() q.enqueue(mytask, args=(foo,), kwargs={'bar': qux},
job_timeout=600) # 10 mins
https://github.com/rq/rq/blob/6bfd47f735de3f297ba3c8f59d5e2dcfa1987107/docs/docs/results.md#L88
Related
I want to know if a lambda execution continues to be performed even if the state of the step function correlated to it times out. If it happens, how can i stop it?
There is no way to kill a running lambda. However, you can set concurrency limit to 0 to stop it from starting further executions
Standard StepFunctions have a max timeout of 1 year. (yes! One year)
As such any individual task also has a max timeout of 1 year.
(Express StepFunctions have a timeout of 30 seconds mind you)
Lambda's have a max time out of 15 mins.
If you need your lambda to complete in a certain amount of time, you are best served by setting your lambda timeout to that - not your state machine. (i see in your comments you say you cannot pass a value for this? If you cannot change it then you have no choice but to let it run its course)
Consider StepFunctions and state machines to be orchestrators, but they have very little control over the individual components. They tell who to act and when but otherwise are stuck waiting on those components to reply before continuing.
If your lambda times out, it will cause your StateMachine to fail that task as as it receives a lambda service error. You can then handle that in the StepFunction without failing the entire process, see:
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
You could specifically use: TimeoutSecondsPath in your definition to set a specific result if the task timesout.
But as stated, no, once a lambda begins execution it will continue until it finishes or it times out at 15 mins / its set timeout.
Since my project has so many moving parts.. probably best to explain the symptom
I have 1 scheduler running on 1 queue. I add scheduled jobs ( to be executed within seconds of the scheduling).
I keep repeating scheduling of jobs with NO rq worker doing anything (in fact, the process is completely off). In another words, the queue should just be piling up.
But ALL of a sudden.. the queue gets chopped off (randomly) and first 70-80% of jobs just disappear.
Does this have anything to do with:
the "max length" of queue? (but i dont recall seeing any limits)
does the scheduler automatically "discard" jobs where the start time
is BEFORE the current time?
ran my own experiment. RQ scheduler does indeed remove jobs whose start date < now.
I am using google-api-python-client and I am using google app engine task queues for some async operations.
For the specific task queue, I am also setting max number of times that the task should be retried(In my case retries are less likely to be successful, so I want to limit them).
Is there a way to write a handler which can handle the case where the task is still failing even after the specified number of retries?
Basically if my retry limit is 5, after 5 unsuccessful retries, I want to move the task to a different queue where it can be retried more number of times with a larger interval between the retries, that way it is more likely to succeed.
From here I believe that I can use X-AppEngine-TaskExecutionCount header in each retry and write some custom logic to know when the task is going to execute for the last time and achieve this but I am trying find out if there is any cleaner way.
By the way X-AppEngine-TaskExecutionCount specifies(from the doc), The number of times this task has previously failed during the execution phase. This number does not include failures due to a lack of available instance.
At least presently there is no support for automatically moving a task from one queue to another.
One option is to keep the task on the same queue, increase the max number of retries and use the retry_parameters to customize the retry backoff policy (i.e. the increase of time between retries):
retry_parameters
Optional. Configures retry attempts for failed tasks. This addition
allows you to specify the maximum number of times to retry failed
tasks in a specific queue. You can also set a time limit for retry
attempts and control the interval between attempts.
The retry parameters can contain the following subelements:
task_retry_limit
The maximum number of retry attempts for a failed task. If specified with task_age_limit, App Engine retries the task until
both limits are reached. If 0 is specified, the task will not be
retried.
task_age_limit (push queues)
The time limit for retrying a failed task, measured from when the task was first run. The value is a number followed by a unit of time,
where the unit is s for seconds, m for minutes, h for
hours, or d for days. For example, the value 5d specifies a
limit of five days after the task's first execution attempt. If
specified with task_retry_limit, App Engine retries the task until
both limits are reached.
min_backoff_seconds (push queues)
The minimum number of seconds to wait before retrying a task after it fails.
max_backoff_seconds (push queues)
The maximum number of seconds to wait before retrying a task after it fails.
max_doublings (push queues)
The maximum number of times that the interval between failed task retries will be doubled before the increase becomes constant. The
constant is: 2**max_doublings * min_backoff_seconds**.
But the pattern of the increase will be gradual - doubling after each failure, you can't get a significant "step"-like increase of the time between retries. Still, it may be a good enough solution for which no additional coding is required. Personally I'd go for this approach.
Another approach is to add that logic to determine if that execution is the final retry of the original task and, if so, enqueue a new corresponding task on a different queue which has the desired "slower" retry policy. I'm unsure if this is what you were referring to in the question and wanted to avoid.
We are experiencing double Lambda invocations of Lambdas triggered by S3 ObjectCreated-Events. Those double invocations happen exactly 10 minutes after the first invocation, not 10 minutes after the first try is complete, but 10 minutes after the first invocation happened. The original invocation takes anything in the range between 0.1 to 5 seconds. No invocations results in errors, they all complete successfully.
We are aware of the fact that SQS for example does not guarantee exactly-once but at-least-once delivery of messages and we would accept some of the lambdas getting invoked a second time due to results of the distributed system underneath. A delay of 10 minutes however sounds very weird.
Of about 10k messages 100-200 result in double invocations.
The AWS Support basically says "the 10 minute wait time is by design but we cannot tell you why", which is not at all helpful.
Has anyone else experienced this behaviour before?
How did you solve the issue or did you simply ignore it (which we could do)?
One proposed solution is not to use direct S3-lambda-triggers, but let S3 put its event on SNS and subscribe a Lambda to that. Any experience with that approach?
example log: two invocations, 10 minutes apart, same RequestId
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:14:09 INFO ImageProcessingLambda:104 - handle 1 records
and
START RequestId: f9b76436-1489-11e7-8586-33e40817cb02 Version: 13
2017-03-29 14:24:09 INFO ImageProcessingLambda:104 - handle 1 records
After a couple of rounds with the AWS support and others and a few isolated trial runs it seems like this is simply "by design". It is not clear why, but it simply happens. The problem is neither S3 nor SQS / SNS but simply the lambda invocation and how the lambda service dispatches the invocations to lambda instances.
The double invocations happen somewhere between 1% and 3% of all invocations, 10 minutes after the first invocation. Surprisingly there are even triple (and probably quadruple) invocations with a rate of powers of the base probability, so basically 0.09%, ... The triple invocations happened 20 minutes after the first one.
If you encounter this, you simply have to work around it using whatever you have access to. We for example now store the already processed entities in a Cassandra with a TTL of 1 hour and only responding to messages from the lambda if the entity has not been processed yet. The double and triple invocations all happen within this one hour timeframe.
Not wanting to spin up a data store like Dynamo just to handle this, I did two things to solve our use case
Write a lock file per function into S3 (which we were already using for this one) and check for its existence on function entry, aborting if present; for this function we only ever want one of it running at a time. The lock file is removed before we call callback on error or success.
Write a request time in the initial event payload and check the request time on function entry; if the request time is too old then abort. We don't want Lambda retries on error unless they're done quickly, so this handles the case where a duplicate or retry is sent while another invocation of the same function is not already running (which would be stopped by the lock file) and also avoids the minimal overhead of the S3 requests for the lock file handling in this case.
I've got a service system that gets requests from another system. A request contains information that is stored on the service system's MySQL database. Once a request is received, the server should start a timer that will send a FAIL message to the sender if the time has elapsed.
The problem is, it is a dynamic system that can get multiple requests from the same, or various sources. If a request is received from a source with a timeout limit of 5 minutes, and another request comes from the same source after only 2 minutes, it should be able to handle both. Thus, a timer needs to be enabled for every incoming message. The service is a web-service that is programmed in C++ with the information being stored in a MySQL database.
Any ideas how I could do this?
A way I've seen this often done: Use a SINGLE timer, and keep a priority queue (sorted by target time) of every timeout. In this way, you always know the amount of time you need to wait until the next timeout, and you don't have the overhead associated with managing hundreds of timers simultaneously.
Say at time 0 you get a request with a timeout of 100.
Queue: [100]
You set your timer to fire in 100 seconds.
Then at time 10 you get a new request with a timeout of 50.
Queue: [60, 100]
You cancel your timer and set it to fire in 50 seconds.
When it fires, it handles the timeout, removes 60 from the queue, sees that the next time is 100, and sets the timer to fire in 40 seconds. Say you get another request with a timeout of 100, at time 80.
Queue: [100, 180]
In this case, since the head of the queue (100) doesn't change, you don't need to reset the timer. Hopefully this explanation makes the algorithm pretty clear.
Of course, each entry in the queue will need some link to the request associated with the timeout, but I imagine that should be simple.
Note however that this all may be unnecessary, depending on the mechanism you use for your timers. For example, if you're on Windows, you can use CreateTimerQueue, which I imagine uses this same (or very similar) logic internally.