Failed cloud tasks are not being retried with task queue retry config - google-cloud-platform

I'm using google cloud tasks with http triggers to invoke cloud functions. I've setup the cloud task queue retry parameters as follows:
max attempts: 2
Max retry duration: 16
Min backoff: 1
Max backoff: 16
Max doublings: 4
I will often have bursts of tasks which will create around 600 tasks within a second or two. There are times when about 15% of these will fail (this is expected and intentional). I'm expecting these failed tasks to retry according to the queue configuration. Thus I would not expect any task retry schedule to be more than 16 seconds beyond its initially scheduled time. However, I'm seeing some failed tasks scheduled several minutes out. Typically, the first few failed tasks will schedule for retry only a few seconds out, but some of the last few failed tasks in this burst will have these retry schedule for many minutes away.
Why are these retry schedules not honoring my retry config?
If it helps, I also have these settings on the queue:
Max dispatches: 40
Max concurrent dispatches: 40

Related

AWS lambda throttling retries

I have a question about Lambda's asynchronous invocation: https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
If the function doesn't have enough concurrency available to process
all events, additional requests are throttled. For throttling errors
(429) and system errors (500-series), Lambda returns the event to the
queue and attempts to run the function again for up to 6 hours. The
retry interval increases exponentially from 1 second after the first
attempt to a maximum of 5 minutes. If the queue contains many entries,
Lambda increases the retry interval and reduces the rate at which it
reads events from the queue.
From the doc, it seems like if I set a reserved concurrency for a lambda function and it couldn't process events due to throttling, the event could be retried for up to 6 hours. However, it doesn't say anything about the total number of retries. How will it be different from the scenario where the lambda function returns an error (it will get retried a maximum of 2 times)?
If the queue contains many entries, Lambda increases the retry interval and reduces the rate at which it reads events from the queue.
It seems that lambda retry only if there are enouge concurrency for it. If not, It will wait up to 6 hours.

Set different retry delay times for tasks within the same dag

I have a airflow dag with many sub-tasks, I know when certain tasks fail they can be re-run in 5 minutes, while other tasks can be re-run in 60 minutes. How can I set my tasks to rerun on failure as such?
I found this question and answer on stack overflow however this only changes the number of retries.
Operators should support a retry_delay as well - see BaseOperator:
retry_delay (datetime.timedelta) – delay between retries

Cloud Run crashes after 121 seconds

I'm triggering a long running scraping Cloud Run function with a PubSub topic and subscription trigger. Everytime I run it it does crash after 121.8 seconds but I don't get why.
POST 503 556B 121.8s APIs-Google; (+https://developers.google.com/webmasters/APIs-Google.html) https://????.a.run.app/
The request failed because either the HTTP response was malformed or connection to the instance had an error.
I've got a built-in timeout trigger and when I set it at 1 minute the functions runs without any problems but when I set at 2 minutes the above error gets triggered so it must be something with the Cloud Run or Subscription timeout settings but I've tried to increase those (read more below).
Things involved
1 x Cloud Run
1 x SubPub subscription
1 x SubPub topic
These are the things I've checked
The timeout of the Cloud Run instance (900 sec)
The timeout of the Pubsub subscription (Acknowledgement deadline - 600 sec & Message retention duration - 10 minutes)
I've increased the memory to 4GB and that is way above what it's needed.
Anyone who can point me in the right direction?
This is almost certainly due to Node.js' default server timeout of 120secs.
Try server.setTimeout(0) to remove this timeout.

Google cloud task queues not running in parallel

I have a project in google cloud where there are 2 task queues: process-request to receive requests and process them, send-result to send the result of the processed request to another server. They are both running on an instance called remote-processing
My problem is that I see the tasks being enqueued in send-result but they are only executed after the process-request queue is empty and has processed all requests.
This is the instance config:
instance_class: B4
basic_scaling:
max_instances: 8
Here is the queue config:
- name: send-result
max_concurrent_requests: 20
rate: 1/s
retry_parameters:
task_retry_limit: 10
min_backoff_seconds: 5
max_backoff_seconds: 20
target: remote-processing
- name: process-request
bucket_size: 50
max_concurrent_requests: 10
rate: 10/s
target: remote-processing
Clarification : I don't need for the queues to run in an specific order, but I find it very strange that it looks like the insurance only runs one queue at a time, so it will only run the tasks in another queue after its done with the current queue.
over what period of time is this all happening?
how long does a process-request task take to run vs a send-result task
One thing that sticks out is that your rate for process-request is much higher than your rate for send-result. So maybe a couple send-result tasks ARE squeezing off, but it then hits the rate cap and has to run process-request tasks instead.
Same note for bucket_size. The bucket_size for process-request is huge compared to it's rate:
The bucket size limits how fast the queue is processed when many tasks
are in the queue and the rate is high. The maximum value for bucket
size is 500. This allows you to have a high rate so processing starts
shortly after a task is enqueued, but still limit resource usage when
many tasks are enqueued in a short period of time.
If you don't specify bucket_size for a queue, the default value is 5.
We recommend that you set this to a larger value because the default
size might be too small for many use cases: the recommended size is
the processing rate divided by 5 (rate/5).
https://cloud.google.com/appengine/docs/standard/python/config/queueref
Also, by setting max_instances: 8 does a big backlog of work build-up in these queues?
Let's try a two things:
set bucket_size and rate to be the same for both process-request and send-result. If fixes it, then start fiddling with the values to get the desired balance
bump up max_instances: 8 to see if removing that bottleneck fixes it

AWS Lambda depleting the SQS queue very slowly

I have an SQS queue and a lambda which consume the queue with batch size 10.
Lambda
Reserve concurrency = 600
Timeout = 15 minutes
Memory = 640 MB (but using 150-200 MB per execution)
Processing one item comes from the queue takes about 10 seconds.
SQS
Messages Available (Visible): 5,310
Messages in Flight (Not Visible): 3,355
Default Visibility Timeout: 20 minutes
With these settings, I'm expecting my Lambda function to be invoked 600 times because as you see the queue is full and there are items to received from the queue. So, the function shouldn't be idle and use all of the available concurrency.
I'm aware of the first 1 minute of burst and later my concurrency will increase every minute until hitting to the limit. But my invocation count is always between 40-80. Never hits to 600 and my queue is depleted very slowly. And (according to logs) almost any of the queue items are failing, so they are not going back to queue again.
What is wrong with my settings?
EDIT:
Also another chart:
Increased up for a moment and decreased again..