Google Cloud Tasks HTTP trigger - how to disable retry - google-cloud-platform

I'm trying to create a Cloud Tasks queue that never retries if an HTTP task fails.
According to the documentation, maxAttempts should be what I'm looking for:
Number of attempts per task.
Cloud Tasks will attempt the task maxAttempts times (that is, if the
first attempt fails, then there will be maxAttempts - 1 retries). Must
be >= -1.
So, if maxAttempts is 1, there should be 0 retries.
But, for example, if I run
gcloud tasks queues create test-queue --max-attempts=1 --log-sampling-ratio=1.0
then use the following Python code to create an HTTP task:
from google.cloud import tasks_v2beta3
from google.protobuf import timestamp_pb2
client = tasks_v2beta3.CloudTasksClient()
project = 'project_id' # replace by real project ID
queue = 'test-queue'
location = 'us-central1'
url = 'https://example.com/task_handler' # replace by some endpoint that return 5xx status code
parent = client.queue_path(project, location, queue)
task = {
'http_request': { # Specify the type of request.
'http_method': 'POST',
'url': url # The full url path that the task will be sent to.
}
}
response = client.create_task(parent, task)
print('Created task {}'.format(response.name))
In the Stackdriver logs for the queue (which I can see because I used --log-sampling-ratio=1.0 when creating the queue), the task is apparently retried once: there is one dispatch attempt, followed by a dispatch response with status UNAVAILABLE, followed by another dispatch attempt, which is finally followed by the last dispatch response (also indicating UNAVAILABLE).
Is there any way to retry 0 times?
Note
About maxAttempts, the documentation also says:
This field has the same meaning as task_retry_limit in queue.yaml/xml.
However, when I go to the description for task_retry_limit, it says:
The number of retries. For example, if 0 is specified and the task
fails, the task is not retried at all. If 1 is specified and the task
fails, the task is retried once. If this parameter is unspecified, the
task is retried indefinitely. If task_retry_limit is specified with
task_age_limit, the task is retried until both limits are reached.
This seems to be inconsistent with the description of maxAttempts, as it indicates that the task would be retried once if the parameter is 1.
I've experimented with setting maxAttempts to 0, but that seems to make it assume a default value of 100.
Thank you in advance.

As #averi-kitsch mentioned, this is currently an internal issue which our Cloud Tasks engineer team is working on right now, sadly we don't have any ETA yet.
You can follow the progress of this issue with this Public Issue Tracker, click on the "star" to subscribe to it and receive future updates.
As a work around, if you don't want the task to retry after it fails, set "task_retry_limit=0" directly on the queue.yaml.
Example :
queue:
- name: my-queue1
rate: 1/s
retry_parameters:
task_retry_limit: 0

Related

requests hang unless timeout argument given

My backend runs on openshift and makes get requests to other openshift clusters via kubernetes python client. I am having an issue where requests hang until the default timeout value is reached. I have done some tests in the pod to see if it can reach other openshift clusters and discovered the following:
requests.get("some_other_cluster_api_url") will hang and return correctly in 2 mins, but requests.get("some_other_cluster_api_url", timeout=1) returns correctly in 1 second. Why does the request not return immediately in the first case?
Edit: curl also instantly returns the right response
As per this Timeout Doc, in the first case you have not set any timeout value By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, your code may hang for minutes or more.
In the second case, you have set the timeout value to 1 , which returns the response in 1 second only.
For example :
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
Then it returns the response in 5 seconds .
Refer to this Doc and SO for more information and usage of timeout requests.

Google Cloud Run not scaling up despite large backlog and available instances

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?
I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

Cloud Tasks Conditional Execution

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).
Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.
Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()

How to retrieve current workers count for job in GCP dataflow using API

Does anyone know if there is a possibility to get current workers count for active job that is running in GCP Dataflow?
I wasn't able to do it using provided by google API.
One thing that I was able to get is CurrentVcpuCount but it is not what I need.
Thanks in advance!
The current number of workers in a Dataflow job are displayed in the message logs, under autoscaling. For example, I did a quick job as example and I got the following message, when displaying the job logs in my Cloud Shell:
INFO:root:2019-01-28T16:42:33.173Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:02.166Z: JOB_MESSAGE_DETAILED: Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently running step(s).
INFO:root:2019-01-28T16:43:05.385Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
INFO:root:2019-01-28T16:43:05.433Z: JOB_MESSAGE_DETAILED: Workers have started successfully.
Now, you can query these messages by using the projects.jobs.messages.list method, in the Data flow API, and setting the minimumImportance parameter to be JOB_MESSAGE_BASIC.
You will get a response similar to the following:
...
"autoscalingEvents": [
{...} //other events
{
"currentNumWorkers": "1",
"eventType": "CURRENT_NUM_WORKERS_CHANGED",
"description": {
"messageText": "(fcfef6769cff802b): Worker pool started.",
"messageKey": "POOL_STARTUP_COMPLETED"
},
"time": "2019-01-28T16:43:02.130129051Z",
"workerPool": "Regular"
},
To extend this you could create a python script to parse the response, and only get the parameter currentNumWorkers from the last element in the list autoscalingEvents, to know what is the last (hence the current) number of workers in the Job.
Note that if this parameter is not present, it means that the number of workers is zero.
Edit:
I did a quick python script that retrieves the current number of workers, from the message logs, using the API I mentioned above:
from google.oauth2 import service_account
import googleapiclient.discovery
credentials = service_account.Credentials.from_service_account_file(
filename='PATH-TO-SERVICE-ACCOUNT-KEY/key.json',
scopes=['https://www.googleapis.com/auth/cloud-platform'])
service = googleapiclient.discovery.build(
'dataflow', 'v1b3', credentials=credentials)
project_id="MY-PROJECT-ID"
job_id="DATAFLOW-JOB-ID"
messages=service.projects().jobs().messages().list(
projectId=project_id,
jobId=job_id
).execute()
try:
print("Current number of workers is "+messages['autoscalingEvents'][-1]['currentNumWorkers'])
except:
print("Current number of workers is 0")
A couple of notes:
The scopes are the permissions needed on the service account key you are referencing (in the from_service_account_file function), in order to do the call to the API. This line is needed to authenticate to the API. You can use any one of this list, to make it easy on my side, I just used a service account key with project/owner permissions.
If you want to read more about the Python API Client Libraries, check this documentation, and this samples.
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-5513132861824326",
enable_page_level_ads: true
});
</script>

Elastic Beanstalk Workers and the VisibilityTimeout period

According to http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html:
If the application returns any response other than 200 OK, then Elastic
Beanstalk waits to put the message back in the queue after the configured
VisibilityTimeout period.
I have set the VisibilityTimeout to 1 minute. My app is returning a 400 error when processing the request. I see from the logs that the request is being re-tried every 2 seconds! I was expecting, based on the above, for it to retry every 60 seconds.
What am I missing?
This might not be the issue of the SQS queue at all. It is true that the message is returned to the queue only after the specified VisibilityTimeout, but it depends on how you are polling the messages.
If you do not access the queue directly (but use some kind of service to do it for you), you have another layer of complexity there.
There's a worker process in Elastic Beanstalk called sqsd that's doing the polling, (processing the messages and deleting them from the queue once you respond with 200).
The sqsd uses similar concept called InactivityTimeout - this specifies the time for this worker to wait for the 200 response and it resends the message after this time if such response is not delivered.
My guess is that the cause of your problem is in this InactivtyTimeout.
If this is not the cause, try looking into the WaitTimeSeconds parameter of your SQS. This specifies that the call to the SQS should be returned immediately if there are messages in the queue (otherwise, it waits for the specified time).
I had a similar issue with an EC2 instance and I specified all the timeouts. In the end - it turned it was caused by a bug in Tomcat - see this: https://forums.aws.amazon.com/thread.jspa?threadID=183473